Subscribe

Shining light on dark data

Sound data management processes and effective search technology are the keys to unlocking and benefiting from dark data.

Jessie Rudd
By Jessie Rudd, Technical business analyst at PBT Group
Johannesburg, 16 Jan 2015

Every day, it is estimated approximately 2.5 quintillion (1018) bytes of data is created - meaning roughly 90% of the world's data today has been created in the last two years alone. This exponential explosion of rapidly increasing volume, variety and velocity of data is fast becoming a huge headache for many businesses. [1]

By the very nature of the digital world of today, data must now be collated and collected from various sources - including the massive influx from digital sources like social networks - accurately and fast.

However, in AIIM's 'State of the ECM Industry' report, it was revealed that nearly half of the firms surveyed (47%) are finding managing electronic office documents a significant challenge, and about three-quarters of modern business communication channels, such as instant messages, text messages, blogs and wikis, are stated as being uncontrolled and off the corporate radar. [2]

Drowning in data

That is staggering. So much is being said and written about 'the age of big data' and its huge benefits to business, when in fact most companies are struggling to just keep their heads above water.

Add to this mix of structured and unstructured data, the data that is stored in tools or programs that have since become obsolete, and you have the beginnings of a huge dark data problem.

Dark data refers to all the little nitty-gritty bits of completely unanalysed data - like server log files, customer call detail, mobile geo-location, etc, that sit in a silo, warehouse, document storage facility, or spreadmart, which everyone is keeping 'just in case'. Much of it is dusty and has not been looked at for years. It is simply filed away as part of a compliance and security protocol, where this data has been stored into its relative 'bites', and some of it comes from legacy systems.

So, in a world where about 50% of corporates are struggling with data collection and compliance, is there any scope for dark data analysis?

It would seem to me that for any data to be valuable, a question must first be asked. A business question, a need for a report, a solution necessary. Without that very basic starting point, everything is just dark data. Dark data becomes light when it is needed to fulfil a purpose. At its heart, dark data is just a subset of big data.

So, how does this relate to the very many companies that are still too immature in their content management, search, and basic reporting to contemplate big data projects, or the conversion of dark data to light data?

Back to basics

Making good technology decisions today, with a view to a big data future, is a step in the right direction. CIOs, no matter the size of the company, need to bring their big data projects back to BI basics. Concentrate on content management, enterprise search and conventional BI capability. In this way, companies could, in the future, realise real benefits from a well-designed and easily accessible big data/dark data repository.

In other words, if sound data management processes are implemented and the company has good search technology, it has made huge inroads into unlocking its dark data so it can make good use of it.

Take a look at a medical claim as an example. What if a patient makes a complaint about a pending claim? Much of the information about the claim has probably been broken down into searchable database records. Not everything, though.

Think of all the communication between a medical aid and its customers or brokers: Word documents and PDF files, notes from doctors, and other content associated with their interactions, along with any internal e-mails.

At its heart, dark data is just a subset of big data.

For a medical aid to get a better sense of how it responded, or, perhaps more importantly, failed to respond, it would make sense to search for this patient-related information in the file system. By using the appropriate keywords associated with a patient's name, account number, e-mail addresses, etc, vast amounts of detail-rich data suddenly become available.

Moreover, what if the medical aid wanted even more context? What if it could search for patients with a similar complaint and then correlate all the results?

Such a wealth of valuable and detail-rich information may not show up through conventional methods.

At the end of the day, enterprise is all about knowing which question to ask, being able to find the data to answer that question, asking the question at the right time, and then having the appropriate search technology at the touch of a button.

Data that is allowed to go dark is a missed opportunity that few can afford. By its very nature, much of the most common types of underused data include social media, Web site and mobile data. This is all incredibly valuable - if found and used in time.

Imagine dedicated software tools that allow a business to confidently and quickly identify shifting brand opinion, product failures, customer dissatisfaction; tools such as Splunk, which sit comfortably on top of the semantic layer, make machine data imminently accessible, usable and of value to everyone.

My advice? Start at the beginning and do the basics right. By being able use the correct tools at the correct time to tap into the wealth of big data and dark data just sitting around in most companies gathering dust - a whole treasure trove of new questions to ask may be discovered.

[1] http://www.storagenewsletter.com/rubriques/market-reportsresearch/ibm-cmo-study/
[2] http://www.computerweekly.com/news/2240088950/Web-20-content-out-of-control

Share