What lies between the lines: The quiet presence of unstructured data

By Martin Rennhackkamp for PBT Group

Johannesburg, 04 Nov 2008

Value as a selling proposition is as much a technique as an understanding of what a customer wants. The latter implies the presence of a deduction, a logical determination of something that has a structure attached to it. Whether that is the [1] creation of an investment policy or a toothbrush, the basis of it remains the same. It is the summation of processed information that lead to that creation. A lot of what we understand about a customer is sourced from traditional operational systems. We look at orders that are placed, colours that are chosen, sizes that are taken and at times returned for others, and we draw conclusions about what is "wanted" or not. It is the length and breadth of our captured engagement with the customer and forms the basis of our analysis of their behaviour and our understanding of the relationship.

The relationship is somewhat sparse for detail if you consider the wealth of things we don`t know about someone when they choose to buy something or decide against it. We are not able to speak to them while they walk down an aisle and watch quietly while they pick something up, put it down and pick up the item next to it. We put forward surveys and multi-choice options of their overall experience, but we will never be able to capture that initial emotion of something either good or bad that has happened with conventional operational means. Which brings me ultimately to my interest at hand - unstructured data.

Unstructured data groups together all the things that are not captured in a tabular or comma-delimited form. You usually find it as a document, or a comment somewhere on a forum, an issue ticket at a service provider CRM system on cancellation of a service and the rant that ensures when things go wrong, an e-mail that you sent to your colleague to ask if the Web site is back up, the video file that you copied from the server to your computer to watch a webinar on the latest BI trends in 2008 or a gadget blog that you read every day because you love mobile technology.

It is very interesting to note who the big noise creators are in this market. Surprisingly, it is currently not the big four (Oracle, Microsoft, IBM, SAP having bought Business Objects and others) making their presence felt. Business Objects, to a degree, did pre-think its involvement by buying Inxight, a text analytics vendor, last year and there have been other similar acquisitions and output efforts by IBM and Information Builders, showing there is interest in moving into this market space but I look forward with some anticipation to see what they do. The main pure-play vendors that are making some noise with this are Attensity, SSPS, ClearForest, Clarabridge and Factiva.

It is making its way into the business world in leaps and bounds with the increase of wikis being used for projects; blogs being used for thought leadership comment and informal open idea spreading; document management systems for document streamlining and, in certain industries, legal compliance (think of Sarbanes Oxley as an example); e-mail that we use every day; files that we share every day. It is far more prevalent that we realise. Merrill Lynch estimates that more than 85% of all potentially usable business information originates in unstructured form [1].

The question that immediately comes to mind when I consider that percentage is how do we analyse it and make sense of what we have? Unstructured data is precisely that, unstructured. It doesn`t come `prepacked` as traditional data does with existing tables and apparent context existing in the data and associated tables. In fact, unstructured data poses an interesting modelling challenge in that the transaction aspect of it represents the dimensions (or context if you prefer) of the event and the event itself (which would be the fact in this case if you want to compare with standard dimensional modelling technique) encased within an informal structure. If you take the update of a document after it has been uploaded into a document management system as an example. The transaction in the database encapsulates the metadata around the moment regarding time of upload, person who uploaded the document, name of document, size changes and date. It does not tell you what changed and that is the fundamental difference. An immense amount of business value lies in that difference. It is untapped and waiting for our exploration.

Another example that illustrates this point rather clearly is that of comments made in an incidents manual for call centre operators when they record what happened on their shift when dealing with a client. There is an existing process (as for most issue tracking systems) that covers states such as resolved, unresolved, fixed, closed and others, but it doesn`t really cover what actually happens at times. Sometimes circumstances are such that the failure of delivery is a combination of factors that cannot be represented by the limited options that an operational system has been built specifically to offer. It isn`t an option in a list box or drop down. So, it gets recorded into a comment box.

That comment box or change in document might not be easily accessible through conventional means when it comes to analysing its content, but text analytics allows us a view into what exists in an unstructured state. In essence it is a variation on data mining, where the focus is on semantics as opposed to number association and significance. There are levels of maturity that do exist in implementing this. A starting point for an organisation would be having enterprise-level content-based search for all their documents. Consider this for a moment, how much time do you spend looking for a document?

The next step is adding context to that search. This is done through adding metadata to the search. That could be the location, the author of the document, the year in which you remember it was put together or the name of the project it was done for. Following this, the creation of a taxonomy and classification effort further streamlines the discovery process with audits on documents and using a formal document management system to ensure (and at times enforce) a "one source of truth" document, as opposed to have several copies lying everything.

This is where you can start talking about discovery systems which do funky things like generate metadata from documents and classify the documents "automagically", the development provision of platforms for content applications that will allow you to isolate the application functionality from the data so that you can manage change within the application layer without affecting the data, data integration that allows access to repositories and application-specific formats and integration with enterprise level applications.

Text analytics provides a window through which we are able to gain deeper understanding into the choices and actions of a customer. Where traditional systems are bound by the structure and logic of their initial design in that it is precisely and only what they display on request (a report is only a report). Text analytics takes the perspective of we don`t know right now what we are looking for, but we seek the means to search for it. It enables a further step in delving into the why and how of something as opposed to only being left with the when and the what.

1. Structure, Models and Meaning: Is "unstructured" data merely unmodelled?, Intelligent Enterprise, March 1, 2005.

What lies between the lines: The quiet presence of unstructured data

By Martin Rennhackkamp, COO of PBT