Subscribe
  • Home
  • /
  • Storage
  • /
  • Thinking zero latency? Think changed data capture

Thinking zero latency? Think changed data capture

Capturing data as it is updated is the most efficient means of keeping data warehouses up to date.
By Charl Barnard, GM of business intelligence at Knowledge Integration Dynamics
Johannesburg, 21 Sept 2005

In this, the final part in a series on optimal extraction, transformation and loading (ETL), I examine the various options for capturing data in real-time.

Transactional systems in most corporations are constantly busy as records are updated, deleted and changed in the normal course of a business day. For a growing number of executives, being able to use data warehouses and analytical applications, knowing the latest business information is included in their analyses, is critical.

Updating a data warehouse monthly, and in some cases even weekly, is therefore no longer acceptable. As defined in the first Industry Insight in this series, it is possible to keep analytical systems updated in real-time without paralysing the performance of transactional processing through changed data capture (CDC) techniques.

One method of CDC is using a log tape to capture and identify changes that occurred in the course of the business day. Of course, reading a tape can saddle IT with additional problems normally associated with this technology, including the annoying habit of different database vendors not to structure their logs in a standard manner. There are also benefits to log tapes in terms of I/O efficiency.

A second method of CDC is to take the changed data out of the company`s database buffers as the change occurs. The change would be reflected immediately, negating the need to read and interpret a log tape at a later time.

There is a performance penalty here due to the need to install software able to notice the changes and pass them on, but this approach will enable the company to handle large amounts of data processing at a high speed.

CDC options

Updating a data warehouse monthly, and in some cases even weekly, is no longer acceptable.

Charl Barnard, GM

There are five recognised capture mechanisms that can be used with various databases. The appropriateness of each depends on the organisation and infrastructure concerned.

* Timestamps: Modern operating systems and databases can enable companies to mark record creation and updates in nanosecond ranges. Making use of these facilities offers a powerful mechanism for capturing changes to databases.

* Database log: Technology advances in databases also allow one to use the change log to capture a history of all changes to the database after a certain date and time.

* Comparison of databases: Using a simple compare utility, an image or backup of the data warehouse`s database made at the last analytical update can be compared to a more current image. The changes can then be extracted and stored in a staging area until they are ready to be integrated into the data warehouse.

* Create snapshots: Instead of harassing the operational system, administrators can take a snapshot of the database and use this in the CDC process.

* Use database triggers: If the operational system is a relational database, and most are, a trigger can be created to save inserted, updated or deleted records. This doubles the workload on the system as whenever a record is changed the database saves the changes and also updates another table to be used for CDC purposes.

Technology enablers

Most analytical applications and database solutions have internal methods for the smooth capture of changed operational records. Whichever option is selected, make sure it allows users to access a variety of options for CDC functionality. It must offer source-specific capture agents and make use of published APIs, log files, journal files and other non-invasive resources to capture changed data.

Although technology is more powerful today than ever before, ensuring an analytics system allows easy access to changed data without increasing demands on system and staff resources is vital to guarantee the smooth operation of business.

Of course, the efficacy of any technology is uniquely dependent on the organisation`s infrastructure and legacy systems may not be the easiest from which to extract data. The reality, however, is that when taking all economics and operational efficiencies into account, CDC will always emerge as the best mechanism to refresh the modern data warehouse.

Share