Subscribe

The adventures of a data scientist


Cape Town, 15 Apr 2014

Harvard Business Review called data science the "sexiest" job of the 21st century, which has generally met violent opposition. There are many adjectives that would be appropriate "fascinating", "intriguing", "interesting" even "hot", but I dare say asking someone to watch you process data is far from stimulating.

As a term "data scientist" was coined by Cloudera heads Jeff Hammerbacher and DJ Patil who decided that people who work in the "big data" space should be called something other than researcher, analyst or data magicians.

So what is the aim of the data scientist?

Data science is a business function underpinned by mathematics and enabled by technology. It is the banker who wants to know the best criteria for extending a line of credit; it is the marketer who needs to know the clients to include in a campaign and it is the service provider who wants to know how to retain customers. The real push towards data analytics came in 2008 in the wake of the credit crisis forcing businesses to make decisions based on mathematical rigor rather than intuitive business sense.

The data scientist is not IT or technology related, yes they use software packages, but how many jobs today could be done without some or other software package. So when engaging with the business it is often more important that the data scientist speak to the business user. Here lies the first distinction. The data scientist has to understand the business challenges and goals from their perspective, he may start out a novice at the industry, but by the end of a project you need to understand the subtle nuances that make that business unique. Anything less would mean that there was a break in communication and if you don't understand the business user then it is more than likely that they don't understand you. Visualisations help in communication. It is easier to the see the big picture when it is a big picture or dashboard. If they have a reporting tool, use it, if not, recommend one that suits their needs. Remember that your goal is to communicate business insights.

The second distinct skill is analytical, mathematical and statistical with a focus on predictive modelling. Predictive models are their own subtle art either seen as a sangoma throwing the bones to tell you that you will die in the next two to 50 years or an artificially intelligent computer postulating the success of a medical procedure. The heart and soul of any data science solution or more factually the science half of data science there are various modelling techniques with varying degrees of predictive power. Some, like linear regression and decision trees are easily understood, however some are quite esoteric like neural net leading many experts to the conclusion that it is easier for someone with a background in the mathematical sciences to learn the computation elements of data science than the computer scientist to learn the requisite statistics.

Finally the component everyone seems to be fixated on, big data. The unbiased truth is that most companies in South Africa may never need to go down the hadoop route. The data sitting in your company's database may be significantly large and may require hours to process, but there are two questions you need to ask. Firstly how much of that data is useful and actually going to be used? Secondly how soon do you need the result and how often would you need to rerun the process? In practice unless the requirement is real-time or near real-time processing of millions of records, expensive and complicated big data solutions are not usually necessary. The requirement still exists for the user to be an expert in data management and manipulation. Having a wide range of knowledge of the various data management tools is a prerequisite; to clarify that doesn't mean being an expert in every software variant in the data management space. It is more about having a toolset you are comfortable with that covers everything from sql queries to jaql programming on a hadoop cluster and having a cursory knowledge of which tools can be substituted for them either proprietary or open source.

Big data is not the silver bullet, but it does have its application. The CERN/large hadron supercollider produces 35 petabytes of data a year and was earmarked at requiring 300 000 cores by 2014. By comparison, does it mean there will never be a need locally for a big data solution? Where is the cut-off? The line in the sand is 5TB of unstructured data or 7.5-10TB of structured data, which cannot be reduced any further and where traditional methods have been exhausted. At that point give your data scientist a wide berth and watch them enter big data "nerdvana".

Share

Olrac SPS South Africa

Olrac SPS South Africa, previously SPSS-SA, is an award-winning predictive business intelligence (PBI) company. The company's team of highly experienced and internationally qualified analysts specialises in the design, development and implementation of state-of-the-art software solutions in this field (aka predictive analytics).

Its market penetration is further solidified by offering an acclaimed suite of analytical products including SPSS, Modeler and Cognos, a superior level of software support and intensive SPSS training courses.

Editorial contacts