The big data exploration: part two

It is important to choose an ecosystem that suits the entire company's current and planned strategic objectives.

Julian Thomas
By Julian Thomas, Principal consultant at PBT Group
Johannesburg, 18 Jan 2018

In my previous Industry Insight, The big data exploration, I defined some key points I believe any company should keep in mind when embarking on a big data initiative. For me, the goal should be to retain a business-driven focus - as big data cannot, and should not, be an IT-led initiative.

Companies must focus and prioritise on business use cases that cannot be accomplished in the traditional paradigm. Of course, this will also mean keeping a flexible, adaptable mindset - or, as I like to term it an 'explorer's mindset'. An explorer must be open to all the opportunities big data represents, while being mindful of the challenges these initiatives will bring.

In this Industry Insight, I examine the factors all businesses should consider when embarking on a big data journey.

Expedition time

Imagine this: a young explorer has just secured his (or her) funding for an expedition into previously uncharted lands. In a smoke-filled room, the explorer and companions plan out the duration, direction and possible destination of their journey, while deciding what supplies and equipment to pack. The success of their journey starts here: how well they plan the journey and how well they prepare for any possible challenges will determine their overall success.

The temptation these early explorers had to avoid was packing too much equipment and supplies. The less luggage they carried, the faster they could travel - and increasing travel speed and reducing the cost of supplies and equipment would have a direct impact on their eventual profitability. This could, however, lead to disastrous consequences. For instance, being caught in the mountains in an early winter without sufficient supplies; or having a faceoff with a dangerous animal without the required weapons to defend themselves.

The less data is spread across disparate products and ecosystems, the more efficiently it will perform.

The same can be said for the modern world, where people are often under pressure to "pack light". Companies feel pressured to keep costs down and the implementation simple. As a result, they often don't have the right support to look ahead and cater for all likely eventualities. The need to remain agile is stressed, as well as not (unnecessarily) adding to the scope of an initiative. However, failing to take the bigger picture into account at the start of the big data journey will ultimately limit the ability to respond rapidly and in an agile manner to future challenges.

World of confusion

Today, there is a real risk that companies will grow their big data footprint erratically - one department or team at a time - based on one individual requirement at a time. Many individual teams and departments are embarking on big data initiatives, which is great, but often this is done in independent silos. The effect of this is a confusing mishmash of platform and product selection across the company, resulting in multiple ecosystems and products in use. All of this increases the cost and complexity of maintenance, support and future development.

As a data and solution architect, I can comfortably state one of my main goals is to consolidate data and solutions. The fewer moving parts an organisation has, the better it operates. The less data is spread across disparate products and ecosystems, the more efficiently it will perform.

And, because of this, I have to say it is not ideal to see this proliferation of similar ecosystems and products emerging within companies. So, what can they do to make sure they don't fall into this trap?

There are many mature, robust big data ecosystems that have emerged, both on-premises and cloud-based - these ecosystems should cover all (or most) of the required functionality a corporate might have within the context of big data. They are: distributed file storage, resource management, SQL & NOSQL databases, data integration programming languages, machine learning tools and languages, distributed messaging and computation engines, governance and security, metadata management and workflow management tools, etc.

I would encourage companies to evaluate the various ecosystems, as well as their requirements, in terms of on-premises versus cloud-based environments. Choose an ecosystem that suits the entire company's current and planned strategic objectives. Consider functionality, cost, ease of maintenance, and availability of local support.

Look deeper into the chosen ecosystem and adopt a flexible standard mode of operation for the entire company to adhere to. A bare minimum would be a common data integration framework, providing a central, standard mechanism for logging, alerting, error handling and escalation.

It is important to note that even within a single ecosystem, there are still many ways to achieve the company's objectives. Within the ecosystem, companies are further encouraged to standardise on their approach to data acquisition, staging, persistence, machine learning and exploitation, and establishing a set of common design and implementation patterns. The ecosystem environment will enable re-use and sharing of IP, as well as a shared pool of skills and experience across the entire company.

At this point, I would also stress the agile principle of inclusivity. The entire company must support the choice of ecosystem, so include all the relevant stakeholders in the evaluation and selection process.

The journey ahead is long, and a company cannot predict everything that will happen. With a little bit of forethought, however, it is possible to lay down a stable foundation for a big data footprint. This foundation will allow the company to grow organically over time across the entire organisation, and most importantly, allow it to mobilise and respond rapidly to new and surprising challenges.