Hybrid data warehouse architectures

Issued by Prescient Business Technologies for PBT Group

Johannesburg, 22 Sep 2004

This seems to be a strangely South African phenomenon. In the past six months, we have been working with as many as three large organisations that insist on having hybrid data warehouse architectures.

Internationally, some organisations use hybrid data models in a single data warehouse; use a hybrid of a data warehouse/data mart; or employ a hybrid of top-down vs bottom-up development methodologies, but very few hybrid data warehouse architectures as we`ve described here.

In essence, a hybrid data warehouse architecture consists of two "pipelined" enterprise data warehouses, the first being an Inmon-style normalised relational database, the second being a Kimball-style dimensional data warehouse - with the goal of providing enterprise-wide data management and enterprise-wide information integration and exploitation.

Both data warehouses are used to store data in its most granular and atomic form, both in time-variant form. The two data warehouses are optionally but not necessarily rolled into a single data warehouse database, optionally but not necessarily employing the same DBMS to manage both. The "rule" of the architecture is that all data must be in the normalised part, while over time, most of it will end up in the dimensional part, if needed by decision-makers, knowledge workers and other information consumers.

Some of these organisations wanted to perform enterprise-wide integration in the normalised part already, while others were wise enough to realise it would be way too complex and time-consuming. Data integration in the normalised part implies drawing up the normalised enterprise data model, which in the past was one of the IT initiatives with highest failure rates. It is just too complex to get it right, and too dynamic to keep it right.

Merits

With double the storage overhead and double the ETL processing required, does the hybrid architecture have any merits? The answer, in our opinion, is yes, to an extent - especially in more federated organisations, but these merits are very hard to justify and it is equally hard to measure their returns quantitatively.

The normalised data warehouse creates a source system specific, but source system independent data store. This is very useful for data management activities, such as data quality, audit, archiving, etc.

The normalised data warehouse also forms a stable and consistent point of reference for the dimensional data warehouse. Provided that the normalised data warehouse is kept up-to-date and correct, it forms a very convenient base layer for expedient information delivery through the dimensional data warehouse. The dimensional data warehouse builders never have to be concerned with the technical or political problems of getting the required data timeously from the various source systems.

An interesting, but totally un-measurable merit has to do with ownership. The normalised data warehouse can be seen as IT`s "property" and is driven on a much more technical basis, while the dimensional data warehouse is much more the home for business-driven and business-focused analytical activities. In this case, IT has the full responsibility to ensure that the normalised data warehouse is populated correctly and completely, while the business has to drive IT for the population of the dimensional data warehouse.

Challenges

However, in addition to the obvious storage and processing overheads, the hybrid architecture has many and serious challenges. We will only discuss two of those here.

Getting standardised 3NF data models in place for all the source systems is a daunting and very time-consuming task, never mind populating them with quality data on-time. Adding time variance to the normalised models is even more complex, as normalisation plays havoc with changing descriptive data.

One of the goals of dimensional data warehouses was the expedient piece-meal delivery of information to the users - Ralph Kimball advocates to develop these pieces per business process or business functional area. Putting an enterprise-wide data management layer in place first goes directly against this goal. In order to have enterprise-wide integration, the whole normalised layer has to be in place first. You cannot have the single view of the customer in the dimensional part, if each and every component of that view is not available in the normalised part. As Ralph Kimball states - if you appreciate the premise that you must have atomic integrated data in the data warehouse to avoid pre-supposing any questions, then why would you want to ETL, store and maintain the atomic details twice? Isn`t the value proposition more compelling to focus the investment in resources and technology into appropriately publishing additional key performance metrics for the business?

Summary

While the hybrid architecture creates a data management platform, which eases some of the IT department`s data management burdens, the crucial question will always be: is there a direct business benefit? But, at the same time, the hybrid architecture creates a very apolitical stable layer from which to build the dimensional data warehouse according to business requirements. In highly federated organisations, it may be the only practical way to bridge the chasm between IT and the business. I mean, there must be a reason why our forefathers used oxen to drag huge wagons across the country instead of faster horses to pull smaller nimbler carts.