When it comes to business intelligence, there are many analogies between data quality and the cleanliness of your car`s windscreen. For one, it can have a serious effect on your view of the (business) world. It can effect split-second decisions you may have to take - sometimes or especially in bad light... Now, you can drive for days on end with a dirty windscreen, but comes a time, you are going to have to pull over and clean it so that you can see what is actually going on (ahead).
Windscreen attendants
Data quality cleansing is often done during the ETL process. Sticking with our analogy - having the attendant clean your windscreen at the filling station is very convenient. You have to wait for the fuel anyway, and they have all the equipment, soap and water available. The ETL process is similarly very convenient - it touches all the data and most of the industry-strength ETL tools have excellent facilities for data cleansing.
However, the window-cleaning attendant does not always address the root of the problem - he cannot clean your kids` shoe marks and fatty fingerprints on the inside. Similarly, the ETL process does not clean the data in the source database - it only cleans what goes into the data warehouse or data mart. Thus, the sticky smudges still remain, even though they may appear better since other superficial blotches have been removed.
In addition, every process has to perform these cleansing tasks whenever the source data is used or re-used. With many migrations going on, the dirty data is migrated from system to system, it surfaces in different ways, and constantly any new or changed processes have to cater for it.
Chocolates
With poetic licence from our friend Forest, every source system is like a box of chocolates - you never know what you gonna get... This is the one serious problem around data cleansing during ETL - the developers get bowled over by so many exceptions and problem cases, that what seemed like a trivial ETL transformation and load process, ends up in a data cleansing up-hill marathon. What makes the problem worse, is that many of the big data quality bugs only hit the windscreen during testing with "live" data (which is more often than not left for way too late).
Add to that the impact of "late changes", where developers have to find and augment ETL code they have developed (and did not document) more than a month ago, to build in additional data quality checks and scrubs. Even in old waterfall-style projects, it was the late additions that killed the project. This domino effect has so much more of an impact in short-cycle quick-delivery style business intelligence projects.
Enter data profiling
Imagine you had a window cleanliness profiler in your car - it would tell you before you hit the filling station whether the windscreen was dirty inside, outside or both; and what cleansing material would work best on the smudges at hand.
Well, in the business intelligence and data migration world there are such magic tools. The data profiling tools now available on the market provide such powerful facilities to analyse the data in the source systems before it is even touched by the ETL processes. With the results of data profiling, the ETL processes can be developed to cater exactly for the data quality problems at hand.
And guess what? No more surprises! With proper data profiling done during the data analysis phase, one can know up-front what the ETL processes have to cater for. The profiling tools indicate whether there are structural problems, data domain problems, relationship inconsistencies, missing data, and the list goes on. Using this input, project managers can now much more accurately estimate the ETL and testing effort that will be required. With the quality indicators and reports available up front, one can have a clear vision of what the ETL development process has to cater for.
Summary
There is no more need to try and drive with dirty windscreens. The attendants now have good tools to wash the windscreens. And you will also know beforehand what effort will be required - whether it will be a R1, R2 or R5 wash-job that is required. Using the data profiling tools correctly, you can have a very accurate estimate of the effort required way before the process is even started.
Editorial contacts


