Decoding data lakes

The data lake metaphor emerged because ‘lakes’ are a great concept to explain one of the basic principles of big data – the need to collect all data, structured, semi-structured and unstructured, in its original format.

Johannesburg, 09 Aug 2020

Those familiar with the business intelligence space will know of the role of a data warehouse. A data warehouse accumulates data from multiple sources, with the objective of providing analytics that drive business decisions. In today’s world of big data, we have started hearing a lot about data lakes. Data lakes, like a data warehouse, is a storage repository for vast amount of data. So, then, is a data lake a different implementation of the data warehouse? In fact, it is quite different.

The term data lakes was coined by James Dixon (Founder of Pentaho). According to Dixon, data warehousing led to information silos, which could be overcome by data lakes.

The data lake metaphor emerged because ‘lakes’ are a great concept to explain one of the basic principles of big data – the need to collect all data, structured, semi-structured and unstructured, in its original format. This metaphor is in comparison to that of a bottled drinking water to which a data warehouse is compared, which is cleansed, structured, formatted and ready for consumption.

Data lakes vs data warehouse

While a data warehouse largely contains structured data, a data lake holds data which could be structured, semi-structured or unstructured. A data warehouse transforms the original transactional data into a purpose-built data store. The multi-dimensional schema is created first and then has relevant and cleansed data written to it. This is many times referred to as “Schema on write”. In contrast, a data lake uses a flat structure to store the data in its native format. The data and schema requirements are determined at the time of querying it, while relevant data is identified and consumed during read time. That is why it is also referred to as “Schema on read”. Since it involves only extracting and transforming initially (from the sources), and transforming later as necessary, setting up a data lake involves reduced initial cost and effort.

Use of data lakes

With the advent of data sciences, data scientists are working with large data sets comprising structured, semi-structured and unstructured data. However, it is not immediately obvious what data is required in the long run. In a bid to uncover patterns, data scientists need to explore the vast pools of data in which the schema and data requirements are not defined until the data is queried. And it is this flexibility that can be effectively met by data lakes.

Data lakes help address a variety of requirements within an organisation. A data lake is a great solution for storing IoT data alongside structured organisational data. A data lake can be an information source for a front-end application. At the same time, it can also be used as a staging area to be eventually fed to a data warehouse. A data lake can be very valuable in supporting an active archiving strategy as well. Concepts of “personal data lakes” for storing, analysing and querying personal data are also being promoted. Healthcare industry can, for example, use data lakes to aggregate inputs from diverse sources and manage treatments in real-time.

Setting up data lakes

Companies that are endeavouring to set up data lakes and bring big data to analysis should be careful not to get inhibited by the technologies catering to specific data types or application scenarios. Such technologies eventually create multiple “data swamps” that require jumping across multiple technologies and datasets for analysis. One must also recognise that setting up a data lake is not where the challenge lies. The challenge is in taking advantage of the opportunities it presents. Recent successes in data sciences are helping overcome such cynicism.

Setting up a data lake does require planning. We need to ensure security, manage master data, manage metadata, data encryption, scheduling of the extraction and loads and so on. While a data lake does not require structured schemas to be set up upfront, data still needs to be organised for optimal retrieval. At the same time, data lakes cannot be implemented in isolation. It should involve business leaders and users who will consume the data in various forms, as well as the data scientists who will enable such consumption.

Technology supporting data lakes

The term data lake is often associated with Hadoop-oriented object storage. Because Hadoop uses commodity hardware and its standing as an open source technology, Hadoop data lakes can be set up affordably. Microsoft also offers data lakes on its Azure platform, while Amazon Web Services offers the data lake solution as an AWS Cloud formation script.

Conclusion

With the projected growth and interest in data sciences, data lakes are gaining traction to deliver new levels of availability of information. We have reached the point where we need to consolidate the data silos and make large volumes of disparate data available for mining and providing “predictive and prescriptive analytics”, thereby uncovering deep business value.

Share

Editorial contacts