Subscribe

The BI sandwich

Data profiling can be a successful tool for metadata - by using BI on top of BI.

Cor Winckler
By Cor Winckler, Technical director at PBT Group.
Johannesburg, 19 Apr 2010

Every so often I sit back and get a little philosophical about the business intelligence (BI) industry and the working environment in this industry.

For varying options, trends and certainly the challenges to the vast opportunities, there is no denying that the BI industry is a dynamic one - one that requires constant development and thought leadership. This philosophical stance is usually brought on by moments when the realisation strikes that companies can - and should - actually be implementing BI on top of BI. While that may sound somewhat confusing, it isn't, and the results are justification enough to do so.

I have already indicated in a previous Industry Insight how run-time statistics of jobs can be used to create a data mart that allows for analysis of the ETL environment. However, in addition to this, another such activity could be the use of standard data profiling techniques, but instead of using it to determine the health of source system data, or the distribution of cardinalities and source system data values, many of the same techniques can be used to profile technical metadata.

Facts and figures

Data profiling refers to the process of analysing data content and structures. Data profiling normally plays a role in the data quality field, to check for conformance. When data profiling is done in the ETL STM space, one normally does not have extensive and continuous conformance checks, but rather performs profiling to understand the data once off; this is done as an iterative process until the desired result is achieved. The output of the analysis will be used to check if the data is fit for a purpose. It would be interesting to check if a collection of data elements or artefacts can be used together to answer complex business rules and calculations.

Standard techniques for data profiling include, but of course are not limited to, the following:

* Testing for uniqueness
* Validating foreign keys relationships
* Cardinality checks

Almost nothing in the world of BI remains constant.

Cor Winckler is technical director at PBT

However, when it comes to metadata, there is often the problem that metadata is collected and stored by different parts of the ETL and design process. For example, table and column layouts come from source system database catalogues; ETL and transformation metadata come from the ETL tool; and reporting metadata comes from the front-end tool. Certainly, all of the above techniques can be valuable to determine things such as:

* How many times a given table is used in ETL.
* Are columns consistently used throughout the environment?
* Are certain columns never used in transformations or in reports?
* Are there reports that refer to similar content, but from multiple sources?
* Are there metadata columns (like descriptions) that are not populated, or consistently null?

The list is potentially endless and these techniques are yet another example where an established best practice principle or discipline from one area of BI can in fact be applied to an unexpected different area of BI. This puts another tool in companies' toolboxes to better serve the growing needs of doing BI better and faster.

So on reflection, it becomes apparent that almost nothing in the world of BI remains constant, except the inevitability of change and most importantly adaptation. Going forward, I believe that by using the methodology and tools that have already been formulated, and by applying these to new areas of the industry, new ways of working will be created, as well as new ways of analysing information, and ultimately, providing new ways to businesses to make decisions.

Share