Subscribe

Unstructured data offers store of untapped BI value

Traditional BI vendors agree that clean data is essential to analysis and reporting. But organisations are losing a huge amount of information if they rigidly apply this approach to data gathering and analysis.
By Garth Wittles, District manager for Verity South Africa
Johannesburg, 19 Oct 2004

Typically 20% of an organisation`s data is structured, which means that it can be used in business intelligence (BI) data analysis. The remaining 80% is usually cast aside and is not factored into the analysis and reporting that business executives find so crucial to making decisions.

The volumes associated with those percentages are growing rapidly every year as data storage costs continue to decline and regulatory issues drive data retention. With the ubiquitous deployment of enterprise resource planning, customer relationship management, supply chain management and other information-gathering systems, the amount of data companies can collect is also rapidly increasing.

The collection of this data does not always follow strict processes - or they are not always strictly applied - resulting in unstructured data. For instance, a company may collect customer information through its call centre, a Web site and paper forms. Processes are not strictly applied in a uniform method across these different formats. Much of an organisation`s information is also stored in text files and Word documents.

Even if all the information is gathered in a database, popular opinion says it remains unsuitable for analysis due to the incomplete fields in rows of information gathered in the relational databases. By using the right tools, dirty XML (eXtensible Markup Language), incomplete relational data or metadata still has value.

The BI world has evolved on a base of structured data with good reason: it is best defined as giving business executives the facts on which to base their decisions. These facts, in the BI technology world that is the framework for delivering them to modern executives, are inherently relational, as they are expressed in tables and reduced to the smallest piece of data possible. For instance, customer data will express a first name; surname; date of birth by day, month and year; identity number; vehicle registration number; and street address, by number, street, suburb and city, along with a postal code. If the postal code is missing, is there no value left in the record?

Many businesses have attempted to draw their unstructured data into this format, because they realise there is value to be had. But what they have been unable to find is a method of getting it there. Typically, they have employed statistical concept extraction and topic-based categorisation. This has, however, proven inadequate from a traditional BI point of view.

Text files or Word documents are unstructured, being nothing more than a collection of concepts or topics to database operators, making it difficult to extract meaningful data that can be used to develop a relational structure. But that view is changing.

One method of attaining structure from text is entity extraction. For instance, unstructured data containing street addresses is analysed by the entity extractor, and assisted by a geographic dictionary, pulls out all the addresses and standardises the format using grammars that are layered above the dictionary.

Besides the data itself, metadata is key to good structure. Once the unstructured data is processed, key fields can be extracted from the data, and the missing or incorrect metadata fields updated.

Even if all the information is gathered in a database, popular opinion says it remains unsuitable for analysis due to the incomplete fields in rows of information gathered in the relational databases.

Garth Wittles, district manager, Verity South Africa

One company that used this entity extraction approach is Superpages.com, a US-based online yellow pages directory. It deployed a multifaceted search function servicing nine million unique visitors generating 150 million page views each month across structured and unstructured data, which is stored in both English and Spanish. The result of searching both types of data - structured and unstructured - is that traffic to the site grew 75% in 2002 over the previous year.

The advent of these new entity extraction software tools allows companies to make use of a large portion of their data that until now only consumed vast amounts of disk space.

Companies need only put this into context when dealing with the information and be aware that the data they are dealing with may be incomplete, but also that there is no absence of value, as Superpages.com has found.

The ability to analyse unstructured data will give organisations the information they need to deal with pressure from regulators, shareholders and a growing need for transparency and accountability.

Verity co-sponsors ITWeb`s enterprise industry portal. It carries articles on the hardware, software and networking technologies and infrastructure that drive and deliver e-commerce, enterprise resource planning, data warehousing, storage, outsourcing, human resources, middleware, Web strategies, business reporting tools and more.

Share