How did we get into such a mess?

By Julian Field, MD of CenterField Software

Johannesburg, 23 Jul 2002

The problem of "dirty data" is now getting such mind share, here and across the world, that it`s probably not necessary to keep on beating the same drum over and over. So I thought I would do some research on just how we got to the point that the data platforms on which our businesses depend got into such a shambolic state.

(And it truly is shambolic. Scan the Web, visit a few data warehousing portals, check out sites such as DMReview and DTWI.com, and see the bandwidth dirty data is enjoying!)

For anyone tasked with getting clean data out of the system, the prospects do not look good.
Julian Field, GM, Ascential Software SA

I can do no better in interpreting the situation as it stands than to defer to a brilliant piece of research, presented by Larissa Moss, president of Method Focus, a company which specialises in improving the quality of business information systems.

Moss acknowledges in an article published on DMReview that the problem of dirty data is both pervasive and insidious. Then she puts forward the question many people are asking: how did it happen in the first place?

* Firstly, there are "inadequate and sometimes non-existent edit checks of operational systems".

* Secondly, companies have sloppy data entry habits. Much dirty data is generated inadvertently and not caught by edit checks; but as much is purposefully entered!

Imagine that: companies are deliberately entering data which compromises systems down the line. How can this happen?

There are myriad reasons. The first is dummy values, where bogus data is entered simply to populate a mandatory field: the data capture clerk is not sure of a person`s postal code or ID number, so they enter 9999 or 999-99-9999, just so as to clear the person`s records. This is easy to trap, but what if the data capture person enters their own postal code or ID number?

A second cause is the absence of data. Different business units have different requirements for data values to run their business. This is a reflection of the fact that operational systems exist to facilitate the smooth running of business, not to provide information for the strategic business direction.

For anyone looking to develop a data warehouse the third challenge is even more fun. It is the issue of multi-purpose fields, shared by various departments in the same company, but for different purposes. Much of the data in these fields will be shared, but typically 20% of the fields are non-common, and it is here that the skills of IT staff are stretched. The same data value in a field can and does end up meaning many different things depending on which department entered it and on specific values in one or several other data fields. Data groups and entire records are redefined over and again, with no documentation left to trace the history of the redefinitions.

This leads to what Moss refers to as "kitchen sink" records, and inevitably a corporate initiative to clean the data.

Moss paints an increasingly gloomy picture. For anyone tasked with getting clean data out of the system, the prospects do not look good, especially as we now find cryptic data, or kitchen sink fields in our kitchen sink record.

Fields used for many purposes lose meaning over time and become cryptic beyond recognition. This is particularly true in systems which have been operational for two to three decades, and where the people who assigned the initial meaning and values to fields have long since gone. Now let`s get down to addresses. You`re likely to find plenty of address data which contradicts itself: incorrect postal codes for cities, for instance. Then there is the gem (highly common!) where an address line is used inappropriately to capture personal and other data. Have fun parsing that!

If you think Moss has uncovered a hornet`s nest of dirty data issues, try these for size:

* Violation of business rules;

* Reused primary keys;

* Non-unique, or multiple, disparate identifiers; and

* Data integration problems due to dirty data.

There are two twists to this dilemma: data that should be related but cannot be, and data that is inadvertently related but should not be. They both have a profound impact on the business.

Moss`s solution is multifaceted, and involves the close collaboration of users and IT in determining the scale of the problem and the appropriate actions: do we dump our existing systems, do we apply packages, do we run tools to try and fix the problems, and where do we begin in terms of priorities?

Sobering stuff!

How did we get into such a mess?

It`s time to come clean about pervasive and insidious "dirty data".