The big data exploration: part three

Big data, the cloud, open source and machine learning APIs are the big four enablers of modern data science.

Julian Thomas.
Julian Thomas.

In my previous Industry Insight, The big data exploration: part two, I continued to emphasise the importance of partnering with business and being flexible. I also stressed it is not only about technology, but finding solutions to help enable innovative and disruptive business ideas. I then explained the importance of working consistently within a coherent big data ecosystem.

If a company has achieved all of this, it is slowly (or, hopefully, quickly) starting to gather all of the available data and is now ready to mine.

Having reached this point, the business might once again hit a road bump in this journey, and may be confused as to how to proceed. The world of analytics is changing even faster than the world of big data, and there are, once again, a great many options and opinions on how to proceed and mine the data accurately. Considering this, what should a business's strategy be? And how should the team operate?

In my final Industry Insight in this series, I will describe what I believe to be the big four enablers of modern data science.

It is no wonder the first enabler is big data. Big data provides the analytics world with access to vast new types of data that were previously unavailable to analysts. For example, call recordings, Web traffic behaviour, video, the Internet of things, social sentiment, etc. With this, the depth, breadth and scope of the type of analytics that can be achieved is greater than ever before.

The second enabler is almost as obvious, and that is the cloud. It can provide data scientists with scalable and flexible platforms, cloud based sources of data, etc. The ability to quickly ramp up, run a complex series of algorithms and then ramp down again (until the next time), is a priceless enabler of data science.

Taking the plunge

The third big enabler is open source. Think of languages like R, Python, or platforms such as Spark. The open source community represents the biggest growth in capability in the data science field. It is where the majority of the cutting-edge development is taking place. If a business wants to be relevant, up to date with what is available in the industry and competitive, it must be prepared to immerse operations in the open source community.

When the entire open source community mobilises, it often provides solutions to problems far faster.

However, there are concerns about the risks, the stability, the support, etc, around engaging in open source. What the past decade demonstrates is that many open source languages and platforms have matured to such an extent that they are completely comparable to the enterprise-grade equivalent. What is apparent is that when the entire open source community mobilises, it often provides solutions to problems far faster. The reality is that vendors are also seeing this and, as a result, are building solutions on the back of open source platforms and languages.

Further to this is realising that the new, emerging data scientists want to work in this open source world. Much like the previous generation studied data science on platforms such as SAS and SPSS, this generation is emerging with R and Python skills, closely aligned with the open source community, and they want opportunities to explore this in business.

Under construction

Finally, the last of the big four enablers, the rise of machine learning application programming interfaces (APIs). The future of data science is not laboriously creating and training one's own analytic models from scratch, but rather, using models that have been created and trained by companies such as Google, AWS, and the list goes on. These companies are investing massively into building various types of predictive models and providing them as off-the-shelf APIs. Think Google's Speech, Vision, Natural Language and Translate APIs.

These companies have the ability to build generic models that can be trained by their entire extended user base. Remember the street signs Google made users identify? Think how accurate a model will be if more than a billion people are training it. What this means is that, more and more, the models needed to build will already exist in the form of an API that users can simply plug into, saving weeks of effort.

So, how does all of this affect a business's engagement model? These big four enablers provide a powerful confluence of capability, agility and value proposition that is simply too good to ignore. Businesses that are not cognisant of these big four enablers and don't take advantage of this in their data analytics strategy must reconsider.

The implication of these enablers is a world where human, machine, software and mathematical resources rapidly deploy analytic solutions faster than ever before. The end result is increased competitive advantage... what every business loves to hear.

Julian Thomas
Principal consultant at PBT Group

Julian Thomas is principal consultant at PBT Group, specialising in delivering solutions in data warehousing, business intelligence, master data management and data quality control. In addition, he assists clients in defining strategies for the implementation of business intelligence competency centres, and implementation roadmaps for a wide range of information management solutions. Thomas has spent most of his career as a consultant in South Africa, and has implemented information management solutions across the continent, using a wide range of technologies. His experience in the industry has convinced him of the importance of hybrid disciplines, in both solution delivery and development. In addition, he has learned the value of robust and flexible ETL frameworks, and has successfully built and implemented complementary frameworks across multiple technologies.

Have your say
Facebook icon
Youtube play icon