Data quality from a business intelligence perspective

By Martin Rennhackkamp for PBT Group

Johannesburg, 16 Mar 2004

In the days before data warehouses and business intelligence, the business unit responsible for each operational system also had to provide information about that part of the business to management, and to the rest of the organisation.

The IT team who implemented and maintained the operational system also coded and implemented the reports that extracted and presented this information - under direct specification from the business unit.

In those days there were lots of data quality problems. With the high volume of backroom data capturing and the clumsy systems used, there were probably more data quality problems than today. The difference was, the business could specify directly what to do with the bad quality data - how to hide it or work around it to deliver seemingly good quality information to the business. Many a COBOL report program contained quick-fixes to "improve" or rather hide data quality on-the-fly.

Since the advent of data warehouses, data marts and business intelligence, and also since some organisations have become much more federated, the landscape has changed considerably. For one, the BI team is now responsible for information delivery to business users often from a different part of the business than from where the data originates. For example, in the retail world, data originating from point-of-sales systems are used in reports providing information to stock acquisition, store planning, distribution and marketing business units. With empowered end-users and knowledge workers placed in the business units, the information consumers often implement their own reports and analyses using the data the BI team makes available to them.

The crucial difference here is that the BI team does not have the authority to adjust bad quality data or fabricate kludges to work around it. They have to deal with it. They have to ensure that the data warehouse content can be reconciled back to the source systems. The problem is often that the end-users of the information, being far removed from the source system owners, view the BI team as the actual providers of bad quality data. With the information delivery a step away from the source systems, the perception is that the BI process has introduced or is directly responsible for the bad quality data, while meanwhile, they have just been the conduit relaying the bad quality data from the originating source to the end-users. They are not at liberty to hide it, as the operational business with their "own" (or directly responsible) IT functions were capable of doing in the olden days.

This remainder of this article explores two fundamental principles related to data quality and the corresponding approaches that should be put in place.

Close the loop

On most data warehouse architecture diagrams, the flow of data is often depicted as a one-way flow from source systems, through an ETL process and a staging area, into the data warehouse and from there optionally through data marts to the information exploitation components such as reports, OLAP cubes, data mining marts, analytical and other down-stream applications.

On many of the diagrams used in "data warehouse methodology 101" type courses, the only feedback loop typically indicated is the one that starts from the sub-project delivery completion back to the start of the next sub-project, which is used to indicate that BI is a programme consisting of inter-related sub-projects, and not an end-to-end project (as is found in the development and deployment of most other IT capabilities).

The problem with both of these is that in an attempt to present an easy-to-comprehend view of architecture and of BI systems development, a crucial concept such as data quality is left out as "detail to be specified later". It is analogous to have an architecture drawing and a project plan for a massive hotel, without the plumbing included... you can imagine the effect on the appearance and functioning of the hotel if all the plumbing components were added as "detail to be specified later".

(This analogy is not that far-fetched - information consumers have similar rights to hotel guests and they are often just as problematic to deal with!)

The crucial missing link is that the data quality feedback loops - and there are many - should be incorporated into the architecture and the methodology from the beginning. The approach to hide the detail to be specified at a later stage (which is also applicable to other aspects, such as disaster recovery, data availability and especially privacy and security) is extremely short-sighted; it always comes back to haunt you!

Organisational issues

The business models used in many organisations are also not always conducive to good data quality practices. Running the BI team as a business which charges out its services forces them to often leave certain details such as data quality "to be specified and implemented later", in an attempt to reduce the initial project scope and budget. Especially in large corporate organisations where there are quite a lot of competing BI initiatives, the enterprise BI team often has to "sharpen their pencils" and work at break-neck speed to get the business units to use the enterprise data warehouse rather than develop their own satellite BI capabilities.

In the process to first catch the big fish, then deal with it, crucial aspects such as data quality and security are left for later phases... which invariably either never happen, or happen too late, when a perception of bad quality has already been formed. The result is like catching a large poisonous fish landed in the boat, which does more damage with its poisoned spikes and thrashing around, than a smaller, more edible fish would have been even though it may have taken longer to land. The BI function should be accountable for what they do, especially as BI is a high-cost function - but using a charge-out mechanism to manage it introduces so many bad practices that it is definitely counter-productive.

The other organisational aspect which affects data quality is what we can call the data quality authority communication gap. In a typical organisation`s BI space there are three groups of role-players: the producers (source system owners and implementers), the facilitators (BI team) and the consumers (the end-users, often from other business units as the producers). When it comes to data quality problems, we can almost turn to legal terms... the consumers (complainants) lodge complaints with the facilitators about the quality of the data received from the producers (the plaintiffs). However, in this case, we cannot equate the facilitators with a legal agency. The BI team does not have the authority to cast any verdict, never mind effect any changes, on the source systems or their owners to improve data quality. They can only facilitate the communication if they are allowed to - the real change has to happen at the producers. The problem here is that the source system owners hardly ever want to accept the responsibility for data quality improvement. They have been running for years without paying any additional attention to it, and in addition, they probably have their own pressing needs to take care of to keep their own units running profitability. Thus, unless they get ordered to do so, the source system owners are not going pay any attention to data quality.

The compounding aspect of this problem is that in the typical corporate organisation, the executives who can issue the necessary data improvement ordinance more than often do not have the slightest inclination to do so, or want to get involved in the issue (it is too low level detail to bother about), never mind what the potential cost or loss caused by a bad decision made based on bad quality data can be.

This is where the business sponsor has a crucial role to play. If he cannot influence the source system owners to pay attention to data quality, he has to persistently escalate the issue to a level in the organisation from where such a "thou shall fix the data quality" command will be issued.

Summary

Data quality is too important to be ignored. The costs of fixing data quality problems downstream are prohibitively expensive; never mind the fact that the fix may have to be repeated for every place where the information is used. The potential cost (or business loss) of making an incorrect business decision based on incorrect or bad quality data makes an even more compelling argument to pay attention to data quality.

This article outlined two important often-overlooked aspects - where data quality must be factored into the BI architecture and into the BI development methodology. It also discussed the organisational aspects that may be counter-productive to data quality initiatives, with approaches how these issues can be addressed.

There are many data quality improvement methodologies that can be super-imposed on most BI development and implementation methodologies. There are also many data quality monitoring and improvement tools available on the market. There are therefore no methodological, technical or organizational excuses why decision-makers should have to live with bad quality decision-making information as a result of any BI initiative using bad quality source system data.

Editorial contacts