Subscribe

Track data lineage to build confidence

For compliance purposes, organisations need to establish a complete "data lineage" that describes any piece of data, starting with its creation and ending with archiving or reporting.
By Charl Barnard, GM of business intelligence at Knowledge Integration Dynamics
Johannesburg, 07 Dec 2005

With the advent of Sarbanes-Oxley in 2002, and the introduction in recent years of other compliance requirements, organisations have been forced to tighten up their reporting processes and to provide proof that their financial reports are complete, accurate and have not been tampered with.

A recent design tip from the Kimball Group, the foremost authority on dimensional data warehousing, highlighted a shift in the business intelligence (BI) arena that is a direct result of new regulatory compliance requirements for financial disclosure.

This shift focuses on the ability to prove the lineage of business reporting: from the initial capture of data, through its transformation in intermediate structures, right up to its final appearance as part of a measure or key performance indicator (KPI).

Even without the legal requirements for saving data, every data warehouse needs various copies of old data, either for comparisons with new data to generate change capture records, or for reprocessing.

The Kimball Group makes the point that data should be staged (written to disk) at each point where a major transformation has occurred. In its basic data flow thread, these staging points occur after four steps: extract, clean, conform and deliver. At some point, staged data needs to be archived (kept indefinitely on permanent media), and a decision must therefore be made as to which data should be archived.

BI companies have to recognise that there is an extended focus on the origination of data, which examines the origins of data prior to its arrival in the data warehouse or BI solution.

Charl Barnard, a product manager at Knowledge Integration Dynamics

The wisest approach is to archive all staged data unless a conscious decision is made that specific data sets will never be recovered. Furthermore, each staged/archived data set should have accompanying metadata describing the origins and processing steps that produced the data. Again, while the tracking of this lineage is an overt requirement of certain compliance regulations, it should in fact be part of every archiving situation.

Kimball suggests that a conservative approach to meeting such compliance requirements should include the ability to:
* Prove (backward) lineage of each final measure and KPI appearing in any report;
* Prove (forward) impact of any primary or intermediate data element on a final report;
* Prove input data has not been changed;
* Prove final measures and KPIs are derived from original data under documented transformations;
* Document all transformations, present and past.

As a result of these "prove that your numbers are true" requirements, BI companies have to recognise that there is an extended focus on the origination of data, which needs to examine the origins of data prior to its arrival in the data warehouse or BI solution. An important part of building confidence in a company`s numbers is to enable transparency and visibility. This includes knowing exactly how key metrics are defined, which data source is used and when the extraction transformation and loading occurred. This is dependent on an accurate record of data lineage in the BI environment.

The time has come to encompass the total data lifecycle to include data integration, data profiling, data quality and data cleansing.

For organisations seeking to strengthen and standardise their data access, quality and security policies, data stewardship initiatives are increasingly viewed as critical to the business.

BI specialists need to take advantage of effective user interfaces and a broad range of available data integration services to provide organisations with seamless data profiling and cleansing capabilities. Data profiling automates the discovery of source data patterns and formats, offering a complete understanding of data, including content, quality and structure. Combined with data cleansing and transformation, data profiling shortens the time to deployment of integration projects by providing insight into the condition of the data before extraction.

Tools need to have a comprehensive set of data quality features, and to maximise the integrity and value of important information. Quality functions cleanse and standardise organisational data by leveraging a rich set of capabilities for parsing, address standardisation and matching to ensure that data is accurate when it reaches its final destination.

Share