The business data lake

By Masindi Mabogo, director at PBT Group.

Johannesburg, 29 Jul 2015

According to Margaret Rouse*: "A data lake is a large object-based storage repository that holds data in its native format until it is needed."

Martin Rennhackkamp** called it: "A scaled-out all-encompassing free-for-all staging area."

In simplicity, a data lake is a large, easily accessible landing place that holds massive volumes of structured and unstructured data in their original form.

There are many write-ups with various narrations on why the data lake came about. Let's attempt to gather these motivations from various write-ups with the objective of circumventing the technical jargon.

The audience

Data scientists, analysts (super/technical users) and developers were the primary targeted beneficiaries for the data lake invention. The data lake speaks to their needs for "quick and elastic" data access without the obstacles of data warehouse (DWH) bureaucracy. It also affords them an opportunity to deal with other types of data (unstructured) that previously presented challenges to the DWH ecosystem.

In recent years, the data lake innovation is witnessing adoption beyond the targeted audience, presenting the challenge for the technology stack to support novice and non-technical users.

The inclusion of the novice users still remains an area deserving improvement in the areas of provision of user-friendly tools for mining these data lakes.

Unstructured data

The data lake is built from the ground up as a big data solution, where unstructured data is still considered as data holding a valid passport to live in the data lake ecosystem.

The data lake is synonymous to big data, with its warm hospitality for both structured and unstructured data, alleviating the need for users to switch between environments to break newly and unified forms of business value promised by the 'holy matrimony' of structured and unstructured data.

The data lake arose in response to new types of data (video, audio, images, text file, binary, etc) that needed to be captured and harvested for enriched corporate insights and competitive advantage.

Quick data take-on

The approach of just dumping information "as is" into the data lake sets aside the vigorous and time-consuming technical complexities engraved into the data warehouse's DNA. This allows data to be made available for business to use timeously. Although the technicality is detached from the data intake steps, they are moved to a step often called "distillation".

The data lake is synonymous to big data, with its warm hospitality for both structured and unstructured data.

Distillation can be approached in cyclical iterations, as and when the data needs to be used. In this step, the business users create map(s) against the data in the lake to generate the view of the data that fulfils their immediate requirements. The mapping process takes a fraction of the time due to the notion of focusing on the immediate and specific requirements. In other words, the structure and interpretation of the data is only done when it is used - this is called "schema on read", as opposed to the "schema on write" approach that is used in data warehousing.

The cost

All data lake write-ups have some cost benefit arguments and they all seem to be riding on the wave of plummeting storage costs. They further embrace the concept of quick turnaround time as well as immediate return on investments, with the benefits of starting small and scaling-up as required. Others cite the cyclical approach that allows cost to be distributed across lines of business at the time of data consumption.

All of the above stands true, until data is ingested into the lake. However, data exploration technology still remains an area of "unknown cost" mainly due to low maturity in the technology/applications enablers to interrogate the data.

The term data lake is being accepted as a way to describe any large data pool in which the schema/structure and data requirements are not defined until the data is queried. The innovation culminated from the thirst for quick data access by technical teams to explore all forms of data types. The data lake is famous as a cost-effective solution in a way that leverages the business need for local views.

As Rennhackkamp says: "If the data lake is used correctly in the BI ecosystem, together with the data warehouse being used for what it, in turn, is good for, one can have a synergistic extended BI ecosystem that can really provide good information and insights to the business as and when needed."

* http://searchaws.techtarget.com/definition/data-lake
** http://www.martinsights.com/?p=1102

The business data lake

Companies use data lakes as a landing place to hold large volumes of data in their original form.

The audience

Unstructured data

Quick data take-on

The cost