About
Subscribe

Data duplication stifles quality

Alex Kayle
By Alex Kayle, Senior portals journalist
Johannesburg, 02 Dec 2010

quality in an organisation will never be 100% accurate due to data duplication, missing content and huge volumes of unstructured data.

This emerged during this week's ITWeb Data Warehousing 2010 event in Midrand where Mervyn Mooi, director of Knowledge Integration Dynamics, gave an overview of how companies can enforce .

“At the operational level, data quality should be at 100%, but reported at strategic level, the data quality may deviate slightly. Within this deviation, a business can allow for a deeper analysis into the data quality problems.

“Some allow for 5% deviation; however, it does vary from organisation to organisation and it depends on business priorities,” he explained.

Governance needed

According to Mooi, organisations striving for improved data quality must create a data warehousing framework that needs to be controlled, governed and accessible.

“Typically, missing content will result in incomplete results,” he said. “Differing content for the same entity will land up with many versions of the truth.

“A recent survey revealed that all organisations in SA have data quality issues in various aspects and have at least a 10% budget spent on fixing the data, [and] to make things worse, we duplicate data everywhere. Main entities are defined more than once with different formats and storage conventions.”

“It's a spaghetti junction from source data into the mart. There is a myriad of data processes that are repeated and overlapped.”

Data quality is a discipline

Mooi added that data quality starts outside of the data warehouse and has to be an enterprise-wide task. “A business needs commonly defined data models and architectures need to be lean. If you address data governance, services, architecture and content, you should have little problems with your data warehouse.”

He said de-duplication techniques are critical to reduce the repetition of data. “A business needs to use an agreed standard architecture based on cost-effective technologies such as de-duplication. Collaboration and data quality is everyone's responsibility.”

Integration is not possible if the business does not understand its metadata, he said, because it gives structure and content. He added that metadata needs to be collated and pooled into a metadata repository.

“Data quality improvement is not a once off exercise; it's a continuous iterative process. It must evolve from an error correction situation to an error prevention situation.”

Share