Exploring data warehouse alternatives: Part two
While a company can change its data warehouse approach, it must avoid falling into the trap of replacing this capability and methodology with just technology.
In my previous article on this subject, I discussed the concept of alternatives to data warehouses. The key aspect that I raised in the piece is that organisations need to remember that technology is not a replacement for a data warehouse.
The importance of the data warehouse was stressed, in that it provides an organisation with a central platform for data staging, transformation and exploitation.
An organisation can certainly change its approach to designing and building its data warehouse; however, it must avoid falling into the trap of replacing the data warehouse capability and methodology with just technology.
So, what are these alternatives?
I think in most cases, one should not be considering this topic as an alternative, but rather as an extension of the core capability. One of the biggest problems historically, is that we tried to make the dimensional, Kimball-style data warehouse the universal solution for all analytic requirements.
The result was that some requirements that were probably not best served by this approach were forced into a dimensional data warehouse solution.
The pattern that should be starting to emerge is that the data warehouse is not going away.
In many cases, alternatives to the data warehouse should be considered within the context of removing some of these use cases from the data warehouse, and building more appropriate, fit for purpose solutions.
We then unpack these use cases from the traditional data warehouse and route them through a more suitable alternative. The key goal here should be to do this in such a way that we don’t end up with a disjointed solution, but still an end-to-end solution and capability that works well together.
The data lake is the first of these alternatives that can be explored. An important part of a data warehouse strategy is the implementation of a common landing zone. The landing zone is where data would be landed once, providing a common set of data that served as a recovery point for the ETL, a representation of the source data that could be validated back to, and a set of data that could be reused by multiple ETL routines.
The challenge was that we were doing this in the context of a relational data warehouse, or file system with limited processing capabilities. This approach didn’t work very well when we had to deal with semi-structured data, very large volumes of data, or rapidly changing data.
The data lake is a very good alternative to this, in that it provides a powerful, flexible environment where data of all sizes and formats can be rapidly ingested. This data can then be made directly available to data scientists for initial profiling and analysis. In many cases, this becomes the preferred source for all subsequent machine learning initiatives.
Very often, this is viewed as just a technology solution, and teams embark on a mass frenzy of data ingestion and exploitation. It is important to remember the data lake is just a technology platform.
Take a step back and consider this from a design perspective. Deliberate how you want to store data, what are the preferred storage formats, how you want to archive the data and does the data require validation or transformation?
All of these are important questions to ask, to formalise a clear and consistent design strategy for the data lake. Consider the need to land raw, unmodified source data in order to create the recovery point, to have a stable area for structured data that can serve as the prime source of data in the lake, to have a custom, bespoke solution zone where you can build data structures to enable specific requirements (for example, standard normalisation data sets for data scientists), and lastly an archive zone.
This becomes the design and methodology that will enable a professional data lake implementation. Consider also how the data lake would integrate into the existing or new data warehouse.
Another example of a data warehouse alternative is the operational data store (ODS). There are many use cases that are dependent on automated data integration, to enable business processes that require batch data integration. These use cases were often pushed into the data warehouse, but due to their operational nature, would be better served with a relational, operational implementation.
This introduces the concept of an ODS, which acts as a storage repository for operational business data. The ODS is typically less focused on historical data and more on current or near current data.
The aim is to model the data efficiently, which is why a relational, third normal form approach is often chosen. This model can be used to drive a variety of initiatives, to facilitate operational management, decision-making that affects the current management and operations of the business, as opposed to the strategic view of the data warehouse.
The ODS can be implemented using a variety of technologies and platforms. The important questions to ask here are how can this be implemented to deliver on the requirements, but also to complement and enable the data warehouse?
We can build an ODS to great effect, in such a way that it enables these requirements. Going further, however, nothing that we do should result in redundant processing.
So, if we are performing ETL capability in the ODS, how can we leverage this for the data warehouse, so that the data warehouse can reuse and benefit from any processing done in the ODS. Anything that is built in the ODS should be considered as a formal source for the data warehouse.
There are more potential alternatives, such as a data lake house and a data vault. I will discuss these in my next article.
The pattern that should be starting to emerge, however, is that the data warehouse is not going away. We might be shrinking its size and scope, extending its capability with additional technology and designs, but we fundamentally will always retain the concept of the data warehouse.
Even when we discuss how we can change the fundamental design of the data warehouse − for example, by adopting a data vault design pattern − we are just changing the format of the data warehouse, not removing it from circulation.
Principal consultant at PBT Group
Julian Thomas is principal consultant at PBT Group, specialising in delivering solutions in data warehousing, business intelligence, master data management and data quality control. In addition, he assists clients in defining strategies for the implementation of business intelligence competency centres, and implementation roadmaps for a wide range of information management solutions.
Thomas has spent most of his career as a consultant in South Africa, and has implemented information management solutions across the continent, using a wide range of technologies. His experience in the industry has convinced him of the importance of hybrid disciplines, in both solution delivery and development. In addition, he has learned the value of robust and flexible ETL frameworks, and has successfully built and implemented complementary frameworks across multiple technologies.
Julian Thomas is principal consultant at PBT Group, specialising in delivering solutions in data warehousing, business intelligence, master data management and data quality control. In addition, he assists clients in defining strategies for the implementation of business intelligence competency centres, and implementation roadmaps for a wide range of information management solutions. Thomas has spent most of his career as a consultant in South Africa, and has implemented information management solutions across the continent, using a wide range of technologies. His experience in the industry has convinced him of the importance of hybrid disciplines, in both solution delivery and development. In addition, he has learned the value of robust and flexible ETL frameworks, and has successfully built and implemented complementary frameworks across multiple technologies.