Subscribe

Zero latency means keeping data current

Data availability and quality are critical components of the zero latency organisation.
By Charl Barnard, GM of business intelligence at Knowledge Integration Dynamics
Johannesburg, 01 Sept 2005

In this, the first of a two-part series on optimal extraction, transformation and loading (ETL), I examine the options for keeping data warehouses constantly refreshed without hindering the performance of transactional systems.

The success of data warehousing initiatives has taught senior management to expect instant access to accurate and timely information to assist them in their decision-making. This reliance on information and analytics has resulted in unforeseen demands being placed on data warehouses as user expectations soar and they demand to see current data added to the warehouse as soon as is possible.

Management has no idea that the data it expects to have available at the touch of a button is drawn from multiple sources and multiple formats, at various times before undergoing an ETL process to get it into the data warehouse - nor should it care. A good data warehouse simply makes information available, hiding the complexity behind ensuring the system is populated with accurate data.

In the ideal zero latency organisation, data from transactional systems will be added to the data warehouse as soon as it is processed.

Charl Barnard, product manager at Knowledge Integration Dynamics

Getting data into the warehouse to start with is a complex task due to the source, format and accuracy issues described above, but there are tools available to assist corporates in accomplishing this task. Companies can choose standard ETL tools or design their own, depending on budgets and in-house resource availability.

Problems arise, however, when trying to ensure the latest data is always available in the warehouse. Running a full ETL process every night is not the answer. Data warehouse administrators need to ensure they have a methodology for getting new or updated data into the data warehouse as quickly, smoothly and non-disruptively as possible. In the ideal zero latency organisation, data from transactional systems will be added to the data warehouse as soon as it is processed.

The changing data landscape

To keep information in the warehouse current, a data refresh policy is required, which can be an expensive task. Three traditional refresh methods are:

* Extracting data from legacy databases into flat files is one example of traditional extraction that is not suitable nowadays. Flat files are cumbersome, prone to error and the whole process takes time and is a drag on the company`s infrastructure.

* Another option is to use database triggers to catch each event that modifies data and record the changes. While this is more elegant than flat files, it introduces triggers, capture tables, database links and custom code that must be created and constantly maintained.

* Then there`s the merge/purge option where all the information is extracted from the sources, verified and the redundant information deleted after comparing it with the target system. This may be the most accurate of the three options, but it is also the most complex, cumbersome and time-consuming process of all - and let`s not even think about resource consumption.

Change data capture

A better approach to refreshing data is to catch each record as it is being updated and then add it to the target or an intermediary database. This approach avoids having to read legacy databases, write and maintain new objects and code, or keep transactional databases busy with analytical tasks.

There are two basic techniques used to catch data as an update is occurring: data replication and change data capture (CDC).

* Data replication: To implement this method, the data to be captured has to be identified before the update starts and then trapped as it occurs. Replication provides more control to administrators as to what data should be caught and what should be let through. There is also less work to do once the data has been captured, as its content and structure are known. However, additional I/O processes are incurred in the replication procedure and, as a result of the changing nature of the data warehouse, constant attention has to be paid to the parameters and triggers managing the trapping process.

* Changed data capture: CDC is the second approach to efficient data refreshment. One CDC methodology will see all changes captured to a log tape throughout the working day from where it can be identified and later written to the target system.

The second methodology is to extract the changed data directly from the database buffers as the change happens. Here the change is immediately reflected in the data warehouse, obviating the need for a log tape.

Whichever method is selected, CDC is the method of choice for zero latency organisations because it enables real-time (or as close to real-time as necessary) updates into the data warehouse.

In the second Industry Insight in this series, I will take a closer look at the approaches to CDC, the methodologies, challenges and some technology enablers.

* Charl Barnard is product manager at Knowledge Integration Dynamics.

Share