When content and data collide

By Dr Barry Devlin, Founder and principal, 9sight Consulting

Johannesburg, 12 Oct 2010

“The worlds of data and content are on a collision course! With ever-growing hordes of content gathering in the business and on the Internet, the old civilisation of the data warehouse is under siege. But, never fear! A solution is emerging - the outcome will be integration, not annihilation.”

For a businessperson, the paragraph above from a recent white paper I wrote makes little sense. Data and content are different worlds? Not in the eyes of a decision-maker, or at least one who hasn't yet been brainwashed by IT. That data and content are considered separate stems from two factors: information modelling and computer storage/processing considerations.

The former consideration still has validity today. The latter deserves a second look, 50 years since the widespread adoption of “data processing” - and that old name for IT is a clue in itself.

Sluggish

Way back in the 1960s, mainframe computers were orders of magnitude slower than even today's PCs, and users were lucky if they had 128K of memory to hold their program and the data to be processed. There was an enormous incentive, therefore, to trim data to a minimum - remove all unnecessary verbiage from the information and structure what remained in the most efficient format for processing, convert categories, single-byte codes and so on. Essentially, convert content or information as used and understood by people to data suitable for machines.

So, given the vast advances in computers since then, why do users persist in this?

The digital world has already moved from data to content.
Dr Barry Devlin is founder and principal of 9sight Consulting.

Partially, it's a matter of habit. There is also the discipline of information (or data) modelling, the fundamental purpose of which is to ensure the logical completeness and consistency of the information that is used to run and manage business. Modelling is the process that converts content into data. And today's approach is that modelling is done during system design, so that only the data need be stored and processed; the content is then discarded.

For example, an order entry process design starts from the story of how people buy the company's goods or services “before computers” (a phone call, letter or conversation) and distils that down to the mandatory fields and records (data) that need to be captured. An order entry application is then built where a trained operator or Web-savvy user fills in the required fields, and voil`a, the company is ready to start taking orders.

Content explosion

This is all very well as long as most information enters the business via such applications. However, the volumes of content are exploding. The IDC “Expanding Digital Universe” Study over the past few years shows the scale of this growth. While data in enterprises is growing at about 20% CAGR, content is growing at approximately 60%. The estimated relative proportion of content to data in 2010 is estimated at an amazing 95% to 5%, up from 85% to 15% only five years ago. While these figures may be discounted for duplication of information and the high proportion of video in the content figures, the trend is very clear: the digital world has already moved from data to content. Full stop.

In terms of the order entry application development and information capture scenario described above, it can be seen that a very small proportion of information is now coming through this channel. While such information is certainly key to running a business, and may still be best gathered through traditional form-filling, there is a huge and rapidly increasing volume of important information that can only be found in the world of content. Social networking/Web 2.0 initiatives allow people to generate content as they see fit. Such content increasingly contains facts and opinions that are vital to business decision-making - and in volumes that dwarf current data stores.

So, the question arises: does it make sense to copy such a huge volume of content into relational databases in order to combine it with the much smaller volumes of data and use it as part of business operations or decision-making? Furthermore, having copied it into the database, text analytic functions must be run there to extract meaningful data from the content. Given that such processing has likely already been done in the content systems to enable search, this overall approach to handling content makes very little sense.

A more reasonable approach is a “unified information store”, which encompasses both relational databases and content management systems. Such a unified information store is essentially a federation of data stores, with an integrated interface that allows users to access data and content in a combined manner. The heart of this store is a core set of indexes and metadata, originating from up-front enterprise modelling on one hand and text analytics of information at load-time on the other. Together, these ensure both data quality and agility. For business, the outcome is analytics that combine the precision of data querying with the relevance of content search, independent of the information source and structure.

Software vendors from both viewpoints - data and content - are already delivering products that blend the two worlds. But, in my view, the key to progress in this area is to avoid unnecessary copying of content and duplication of text analytic function to extract meaning and structure from content. Businesses that begin to implement a unified information store stand to gain early adopter advantage in this rapidly growing market.

When content and data collide

The massive amount of content gathering means the data warehouse is taking strain.

Sluggish

Content explosion