Balancing relevance and correctness

By Martin Rennhackkamp for PBT Group

Johannesburg, 03 Nov 2003

In the business intelligence world, the current drive is to increase the relevance of information. Relevance is defined by three independent axes, namely timeliness, breadth and depth. Timeliness relates to getting the most immediate information available. Depth refers to getting to the most granular level of detail. Breadth addresses the ability to get consistent and integrated information from the widest variety of data sources (within and external to the organisation.). Thus, the more we can raise the bar on each of these axes, the more relevant our decision-making information is, and the more value we will get in return from it.

Vendors are pushing toolsets and architectures to increase relevance. Timeliness is being addressed through real-time data warehousing approaches and EAI (enterprise application integration) toolsets. Depth is addressed by loading more detailed operational information. Breadth is a very interesting case in point. Here we are not only talking of integrating the multitude of structured data sources within and external to the organisation`s, but also a myriad of sources unstructured data.

It is with the breadth issue that tool vendors and warehouse architects are battling. At the offset, data warehousing and information exploitation were primarily concerned with structured data - data extracted from traditional record-keeping operational systems that are neatly and effectively structured in tables with columns. However, as the need evolves to deal with the customer and supplier (and even the employee) on a much more direct, personal and therefore informed manner, the requirement has evolved to also have access to highly unstructured data. Examples include scanned images of contracts, replays of recorded conversations, and highly unstructured data such as presentations, electronic documents and e-mail messages.

For example, at The Data Warehousing Institute (TDWI) summer conference, held in Boston in August 2003, Dr Barry Devlin of IBM proposed an extended architectural framework consisting of replicators, ETL tools, information integration tools, remote access tools and a hoard of other components that would cater for both structured and unstructured data, without the need to pump the unstructured data through the extensive ETL "plumbing".

However, it is at this point that we need to step back - and once again re-learn a lesson from the past. If we view a data warehouse from the premise that it is just a specialised database, there have been a number of significant developments that have substantially influenced databases, and subsequently data warehousing, over the last few decades. These include the ANSI/X3/SPARC three-tiered DBMS architecture, the relational data model proposed by the late Dr EF Codd and more recently, the dimensional data model developed and advocated by Ralph Kimball.

However, all three these cornerstone developments have one simple aspect in common - one can only achieve their intended benefits if they are implemented correctly. For example, most of the relational DBMSs available since the inception of the relational model do not strictly and correctly enforce the use of the theoretical model as drawn up by Dr Codd. It is therefore possible to implement non-relational database schemas, with high degrees of redundancy and data duplication, which in turn, can lead to highly incorrect and inconsistent information. The crux therefore lies in both the model (or architecture) and its correct implementation. In the case of the relational model, strict and correct enforcement of the various integrity constraints ensure that the characteristics of the model are correctly adhered to.

There are many reasons why the de facto data warehouse architecture is appropriate. Initially, it was created to improve information retrieval performance. It reduces risk and increases availability - if either the operational systems or the data warehouse has become unavailable, users of the other components can continue their day-to-day tasks autonomously. The architecture facilitates the creation of an integrated informational view of the entire enterprise. For the first time, data from diverse and seemingly unrelated systems are integrated to form a coherent and integrated source of information. The data warehouse architecture allows independent growth and scalability. In organisations where data is recognised and managed as a major corporate resource, the modular architecture facilitates better and more controlled information management.

With the data warehouse becoming the sole source for management-level information, it is very important that the data contained therein and the information provided there from be of the highest possible level of quality - in terms of accuracy and correctness, after relevance and timeliness. The de facto data warehouse architecture has a very simple, very well controlled source-to-warehouse data flow, along which aspects such as data quality control is supposed to be embedded. The best part of the architecture where data quality can be analysed, detected and affected is in the ETL processes, where the data is already going through an analysis and processing cycle. The staging area, as a back-end "kitchen" area is the ideal place for data quality control processes - where the data gets pre-prepared to have a high standard of quality and correctness before it is served to the users in the front-end "restaurant" area.

The problem with Devlin`s proposed extended architecture is that it violates the basic "data flow" principles of the de facto data warehouse architecture. In his proposed extended architecture data flows in many directions, and through a multitude of facilities - and sometimes it is even directly assessable. The data can therefore in theory land up uncontrolled, unchecked in the data warehouse - immediately available to many users. This is analogous to having the suppliers dump food right on the restaurant customers` plates, without first checking its quality and pre-processing it in the kitchen.

The crux of the matter is that if we are going to short-circuit architectures and quality control procedures to try and improve relevance, we will most definitely risk compromising data quality and correctness. At the end of the day, we can have the most detailed data, from all the possible sources available immediately, but if it isn`t correct, it is worth nothing. The challenge is to increase relevance without compromising quality and correctness.

Editorial contacts