A work in progress: Toward data swamps, lakes or oceans?

By Mervyn Mooi, Director of Knowledge Integration Dynamics (KID) and represents the ICT services arm of the Thesele Group.

Johannesburg, 13 Apr 2021

There has been widespread talk among industry pundits and vendors that the traditional data warehouse is set to be superseded by data lakes, with new and emerging tools that ingest and store any hoard of “big” data, allow for analytics on the fly as needed and have the promise for speed of delivery to the consumer or analyser.

Compelling as this may sound, the world is still a long way from this ideal, and the traditional data warehouse is not dead.

Data lakes, holding vast amounts of raw and unstructured data in its native format, can prove to be more costly and complex than many organisations can handle.

Unless an organisation has a massive appetite, “distilled” business cases and budget for big data analytics and the resources to support it, seeking actionable insights from data lakes can be akin to trying to “boil the ocean”.

Several local organisations that bought into the dream of saving time and money through data laking and big data analytics have since found that it can quickly become an unnecessary, messy and expensive swamp.

On the other hand, most organisations – including major enterprises – still depend heavily on a traditional data warehouse, largely because its structured data has been through the hoops of validation and quality control, and can be trusted.

Although there could be some latency, the data in the traditional data warehouse is often all organisations need and may contain decades’ worth of business-critical IP. Tried and trusted, traditional data warehouses are not without their flaws, however. Among these are a limited scope for handling big data volume, variety, velocity, prescriptive analytics and seamless integration with data lakes.

There is also the need for traditional methods to duplicate data across operational / stage / data warehouse stores, analytical (BI) environments and backup and recovery environments.

Data lakes, holding vast amounts of raw and unstructured data in its native format, can prove to be more costly and complex than many organisations can handle.

As the world of data laking evolves and moves to integrate with data warehousing, organisations need to aim for the sweet spot between the flexibility and ability to address evolving business questions that data lakes offer, and the trusted data, structure and value the traditional data warehouse can deliver.

Currently, enterprises with a large appetite for big data analytics should be employing both. But until data lakes can offer a consistent, democratised and structured approach to derive metrics and analyses and store it in a structured useful way, they cannot fully deliver on their promise and may not be the right approach for budget-constrained businesses with limited need or capacity for big data analytics.

Meanwhile, traditional data warehouses should be morphing to become future-proof. Indeed, many organisations are now starting to refactor their data warehouses and are bringing their decades of IP into the mainstream data sciences / advanced analytics environment. The two worlds will likely merge in years to come, bringing the best of both in a manner that allows for predictive analytics that can be trusted.

Further into the future, we may even achieve an environment best described as a “data ocean” in which only a single version of all data from transactional and production systems and records resides, with a virtual layer in place to allow for real-time analytics without tarnishing the original data. This environment could provide a “one-stop trusted data store” with no replication of data.

This data ocean would be deployed on a fail-safe “no data loss” data platform which modern day technologies already provide. The compute power and bandwidth of such a platform would be astronomical compared to present day capability, and allow for multiple “on the fly” data discovery, curation, integration, derivations, reporting and analysis via a virtual access and presentation model.

The data ocean would be the logical boundary of all and any internal / private data that an organisation generates, as well as any external data it would be entitled to or allowed to mine or explore on the WWW. There would be no need for a backup site nor the need to duplicate data into an ODS (operations data store), lake, staging area, data warehouse or data mart – only key derivatives and insights (results) would be locally persisted by the consumers.

A work in progress: Toward data swamps, lakes or oceans?

Are the days of data warehouses numbered, or simply morphing into an exciting new model for data platforms?