Subscribe

A Day in the life of Citizen Data Scientistovich (*)

By Allyson Towle
Johannesburg, 11 Jan 2016

ITWeb Business Intelligence Summit 2016

Hear detailed insight into data science, a South African perspective from Jasper Horrell, SKA Project; Megan Yates and James Turton, Ixio Analytics, at the Business Intelligence Summit 1 and 2 March 2016. Click here to book your seat.

Data Science is a relatively new concept in South Africa. ITWeb's 2016 Business Intelligence Summit turns the spotlight to the skills required and supporting infrastructure needed to accommodate a data scientist and why you need to pay attention.

Yigit Karabag's semi-fictional story, while not conclusive, is intended to show how analytical data exploration can be placed into the hands of the 'citizen data scientist'. This would help build a data-driven business culture, empowering the traditional analysts of yesterday to become powerful modellers. Karabag is head of information management and analytics practice - Middle East, Turkey and Africa at the SAS Institute. The story follows below.

An introduction to next generation BI

Craig is a business analyst who has worked for an auto-insurance company for the past eight years. Despite the fact that he is curious, quite business-savvy and has spent quite a bit of time self-learning about analytics, he has never used the advanced tools that are being used by the actuaries and fraud managers' people in the organisation.

One day he gets an internal memo, inviting him to attend a session to "get introduced to next generation BI solution of the company". Curious, he immediately signs up for the event.

Turns out it is a very short session with a few flashes of data visualisation screens and some outrageous statements such as "you can now build a predictive model without requiring the statistical knowledge behind it" and "how about analytical segmentation of customers with a click of a button?"

Craig's curiosity is far from satisfied. He already has an assignment to analyse the data from a recently installed car telematics(**) system and he wants to find out if this new BI solution will make his life easier or not.

After a few inquiries and two days later, he finds himself sitting in front of the "next generation BI solution" of his company. On the home screen, he can see some of the things he could start doing, such as "Explore Data", "Design a Report", "Prepare Data", "Build an Analytical Model".

Prepare

Instinctively, he selects "Prepare Data" option and within the next few minutes he can happily see the car telematics data set he uploaded in front of him, ready to be "prepared". He is a veteran data analyst so it doesn't take much effort for him to create a few calculated fields, as well as join the telematics data with driver demographic information from the customer database to enrich the data set.

Explore

Yigit Karabag Head of Information Management & Analytics Practice - Middle East, Turkey & Africa at SAS Institute.
Yigit Karabag Head of Information Management & Analytics Practice - Middle East, Turkey & Africa at SAS Institute.

Once pleased with how the data looks, he starts exploring his data set. He already has some questions. Some are part of the official assignment, some are not. Officially, the aim is to determine if bad driving behaviour of the policy holders can be classified and whether this classification is based on the demographics of the drivers. Unofficially, Craig wants to test the prejudices against female and old drivers against real-life driving data to see if these theories hold any merit.

As the first step of his exploration, he wants to determine the most predominant driving "events" that are related to bad driving. In order to do this, he quickly visualises the fast accelerations, sharp braking and speed violations in a tree map, then instantly creates a custom field to describe the bad driving behaviours in a standard way (i.e. any speed over 120 kph is categorized as a "speed violation").

Categorise

Once he has categorised bad driving behaviour, he creates a count of "incidents" per driver with a few clicks and he is ready to try segmenting the drivers, based on their driving habits, by performing a cluster analysis. Since Craig is not trained on statistics and does not know various algorithms such as K-Means for clustering, he needs to press a button that says "Clustering Analysis" on it and the next generation BI solution of the company will take care of all those details for him. As soon as he hits the button, he is asked which variables in his data set to consider while clustering drivers. Craig chooses his custom bad driving events field, as well as other demographic details of the drivers and the system instantly generates four segments of drivers for him.

Review and Segment

After reviewing the segments briefly, Craig labels them as:

Segment 1 - Mostly good driving, but have several speeding and harsh acceleration events.

Segment 2 - Reckless drivers with harsh braking and cornering.

Segment 3 - Best drivers with only a few harsh acceleration events.

Segment 4 - Most likely "Hot Hatch" drivers, with a very high number of harsh acceleration events.

Build a model

Satisfied with the way drivers are segmented, Craig now wants to build an analytical model that identifies the strongest predictors for bad driving behaviour. Since he only has high level understanding of different predictive modelling techniques, he decides to build multiple models on the same data and let the next generation BI solution decide which one is the best performing model.

After a few clicks and a few minutes he has a champion model in front of him that is automatically evaluated and verified, telling him that all the prejudices against the gender or the age of the drivers are absolutely wrong and have no scientific basis and power-to-weight ratio of the vehicle plays a much important role in erratic driving behaviour than any other demographic attribute.

What did we learn

The key takeaway benefits for Craig the business analyst was that he could quickly upload and access his business data; prepare and enrich the data; analyse millions of records and create custom categories and hierarchies; build analytical segmentation models; build predictive models, validate and compare them without IT or a statistician's assistance.

(*)My humble tribute to the great book by Aleksandr Solzhenitsyn, "One Day in the Life of Ivan Denisovich"

(**)Telematics as an interdisciplinary field encompasses telecommunications, vehicular technologies, road transportation, road safety, electrical engineering (sensors, instrumentation, wireless communications, etc.), and computer science (multimedia, Internet, etc.).

Share