About
Subscribe

Real-time data warehousing

Johannesburg, 29 Jul 2003

Long, long ago, when data warehouses were young, hard disks were expensive, and down-stream processing was mostly analytical, data warehouses were pretty much time cycle snapshots. In the retail world, for example, MIS users were satisfied with monthly snapshots of sales by store by SKU. For most organisations, this was already a lot of detail to deal with.

Nowadays the "downstream" processes (reporting, analysis, dashboards) depend more on detailed data. For example, data mining applications score customers based on their item-per-basket buying behaviour, which is then analysed even further.

However, it has become even more time-critical. With online retailers the scoring has to be up-to date, because the whole shopping experience is customised according to recent behaviour. If I bought 10 books on astrology yesterday, and a new astrology book hits the shelves today, they want to put a banner for that book out there before I log onto the site again. Monthly scoring is not good enough. Detecting credit card fraud a month after it happened is useless from the business and clients` perspectives, since by then the credit card has already been fraudulently used to its limit. If my card is stolen, and my bank is smart enough to spot anomalies in my buying pattern immediately, they may be able to stop the fraudster almost immediately.

This means that the loop between analysis and data mining and operational activities needs to be closed much quicker. Buying behaviour analysis and fraud-detection need to be real-time or near-time.

Technology

Disk space has become cheap. Storing detail is no problem. Storing lots of history is no problem. CPU has become cheap. Mining models that used to run weeks can now be completed in five minutes, making it practical to publish the results as they become available.

What are needed are ETL (extract, transform and load) tools with the ability to do short spurts of work in a timed or event-driven fashion. For example, instead of processing a million or more changes per night, the tool needs to wait for 100 changes, but if less than 100 changes occur in three minutes, to start processing the batch anyway.

Our source systems need to have the ability to continually feed changes to the ETL tool. This can be done through replication facilities and TP Monitors, which are the predecessors of EAI (enterprise application integration) tools.

EAI tools can publish such changes as a continual flow of messages with guaranteed delivery into the ETL process. The ETL tools, in turn, must have the ability to subscribe to the EAI message queue. In future we will probably see an integration of ETL and EAI tools as these technologies converge. However, the current set of available EAI tools implies a very mature (and expensive) IT infrastructure.

On the application side, the data mining tools must have the ability to do "continual scoring" as the underlying data is being updated in near- time in the data warehouse.

Obviously, the underlying DBMSs must have the ability to maintain indexes and summary tables automatically as the data in the data warehouse changes. The DBMSs must also remain online and queryable while the underlying data is being changed. This is difficult for conventional relational DBMSs that use transaction logging and locking.

Business requirements

Real-time data warehousing requires a very mature IT infrastructure and MIS environment. It must be well justified by the business. A very important point is that the business should require real-time data warehousing, as the two examples illustrated. Great care should be taken against real-time data warehousing becoming an IT hobby horse with no direct real-time requirement from the business, and also no real-time direct benefit to the business.

There are two exceptions to this guideline, namely a very heterogeneous environment and data volumes. EAI is an appropriate systems interface solution in very heterogeneous and federated environments. It makes good sense to deploy it for ETL as well if it is already being used for other inter-system communication. Data volumes may force the organization to start doing real-time data warehousing - where data is processed and loaded as it arrives. The batch window then enlarges to 24 hours - and the load and performance impact becomes a continuous trickle-feed. This is practical for call records, shopping carts, ATM transactions, etc, where it has become impractical to wait until midnight before attempting to load all 40 million records for the day.

Challenges

One of the biggest challenges is overcoming the technology hurdles. If there is only one part somewhere on the architecture diagram that is not "real-time-ready", it will fail. The whole environment must have the ability to accept a small update from the source systems at one end and see it all the way through the data warehouse to the analysis tools at the other end, without ever losing the update.

Developers must learn the skills and techniques to streamline the loading processes as a continuous stream of small updates. There are a number of technical difficulties which have to be overcome. The overriding principle is that "the show must go on". The whole environment has to be geared to handle changes as and when they happen, to handle exceptions without "stopping the bus", to re-process old and late arriving records at any time, and to back out and reload small pieces of data.

Business must be able to react to the near-time feedback from the warehouse. This may even imply new or changed business processes.

Summary

Due to more pressing business requirements, larger data volumes, and the availability of technologies such as EAI, real-time data warehousing is becoming a reality that will see greater popularity in the near future.

However, with it comes the price tag of technology and development challenges and even changes to business processes, as it reduces the capture-to-analysis window dramatically.

Share

Editorial contacts

Martin Rennhackkamp
PBT Group
(021) 551 0937
martin@prescient.co.za