Mark Madsen says data science is half art and half science.

Mark Madsen says data science is half art and half science.

It's not the insight, but what you do with it. Analytic insights that result in no action are expensive trivia.

So says Mark Madsen, president of Third Nature, during his keynote address called ‘Beer, diapers and correlation: A Tale of Ambiguity', at the ITWeb Business Intelligence Summit 2017, held this morning at The Forum in Bryanston.

He says the beer and diapers story began around 1992. It claimed that men would buy beer when running to the shops to buy diapers, and wanting to capitalise on this phenomena, the shop placed these incongruous items together, and sales soared, he explains.

There is information gain that happens here, he says. "So why should a business do this sort of analysis? To increase co-purchase, cross promotion, price optimisation, inventory management, and refine marketing. Essentially lifting inventory to match expectation."

Beer and diapers

Madsen says he first read about the beer and diapers story in Chain Store Age Magazine in 1993. "I ran a test in 1994 in a grocery store, and specifically searched for this pattern. I was sceptical. In the US, state by state alcohol laws are different. For example Pennsylvania you can't buy beer and diapers at the same time."

Did he find the correlation? "No. However, in 1995 I found the correlation. After that, every time I built a data warehouse at a retailer, I ran the analysis. I found a really high correlation in 1997. Some years there was a correlation, some years a weak correlation, and some years no correlation at all. So is it true? Is it true. Yes. No. Maybe. I don't know.

"As an analyst, this really bothered me, I thought maybe this is just random observation. Maybe I'm seeing random purchase pattern, and am projecting my desire on to the outcome. It could be bad model fit."

So where did the story come from, he asks? "I found 11 400 academic papers and 14 000 books on analytics that mentioned the beer and diapers story. I pulled some of the quotes, and each story was slightly different. Remember, that in academic research on data science and machine learning, PHD means BS piled higher and deeper. They are just as bad as fact checking as the rest of us."

The trail is muddy, says Madsen. It goes back to 1992, to Osco stores in the Midwest of the US. "It used 90 days of point of sale data from Osco drug stores, some 1.2 million baskets. Between 5pm and 7pm customers tended to co-purchase beer and diapers. So we have a correlation. A lady called Karen Heath did the analysis, although she never gets credit for it."

What did they find, and how? "They looked for correlation with baby products because they were high margin. They used SQL queries to find the relationships. They had a correlation between beer, diapers and time, but none with age, gender or day, and it didn't exploit the information by moving the products around. There were no loyalty programmes and no way of knowing the sex of the buyer, and no way to make attribution. It didn't distinguish between the actual affinities tested and our hypothesis. Somewhere fact blurred with folklore. In other words, don't let data get in the way of a good story."

Art and science

Madsen says data science is half art and half science. "Analytical cubism paints a picture in a way that is similar to how we see and remember a subject. Although cubism seems abstract, it tries to capture the truth, and it may be a more realistic way to paint a portrait.

"Take a 3D of someone's face, flatten them out and put the image on to 2D. This gives a more realistic view, of all sides combined on to one canvas. More realistic reflection of reality than a standard Dutch realist."

He adds that as with art at the turn of the last century, analytics, particularly how people think they should be used, can get stuck in in absolutist thinking. "Models are like art, they abstract reality by capturing dominant features, the essence of a situation. Like cubist paintings. An analytic model is something that pulls together the dominant features in the data, and leaves out a lot of the data."

Madsen says the general idea is correlation, a relationship between two things, or statistically speaking, a quantity measuring the extent of interdependence between variable quantities.

A fractured picture

So with beer and diapers, fragmentation and ambiguity was a problem. "There are mixed answers, and some correlations, but are you asking the right questions, and under what set of circumstances is it true?"

Enterprise reality is that causality is greater than correlation, he adds. "You wouldn't do anything without some idea of cause or context. Correlation isn't enough. So what is the explanation? What is the defining function linking the two together?"

This is where ‘analytical cubism' comes in. It is called this because of its structure. "Dissection of the subject, viewpoint by viewpoint, resulting in a fragmentary image of multiple viewpoints and overlapping planes."

This is the same thing that is happening here with the beer and diaper story, he says. "With this, the model is seeing pieces of a whole, because each of these models is independent with its own observations and data, like a snapshot that covers only the focal point of your eye. There are multiple pieces not independent, but dependent on a larger frame of reference."

When they are lined up together, the larger scene emerges in the same way that your brain puts together a scene by focusing on individual pieces and assembling the picture from them as your focus moves from place to place, concludes Madsen.