Subscribe

Instrumentation in BI

Real-time statistics on the extraction, transformation and loading process.

Cor Winckler
By Cor Winckler, Technical director at PBT Group.
Johannesburg, 05 Aug 2011

In a previous Industry Insight, I elaborated on the importance of keeping technical metadata around the extraction, transformation and loading (ETL) process in order to keep track of what has been executed, when, for how long, and how many rows were inserted, updated or deleted. It highlighted all the relevant and important aspects to assist with reporting on what has technically taken place in the ETL environment.

Considering this, I felt it necessary to provide a more technical overview that deals with this concept, which is crucial to take note of in the overall process of business intelligence (BI).

The term used in the industry is known as 'instrumentation', which refers to the code that is inserted into the tool or hand-coded ETL. This allows the user to monitor the progress of tasks, and in addition, some of the outputs of this instrumentation enable BI on top of BI, which was addressed in the previous Industry Insight. As a bare minimum, inputs are required such as a job name, start time, end time, number of records inserted, update or deleted - however, instrumentation does not end there.

Task time

Consider a scenario that is extremely familiar to any 'techie' who has ever run a long-running ETL task. The project manager walks in, looks over the techie's shoulder and asks: “How far is that task?”

The techie could answer: “It started three hours ago, at exactly 11:30am,” as the techie wrote the start time into the metadata. However, that does not answer the question! What is required is an indication of how much longer the task is going to take to run. Of course, the techie could look at a previous run, which took four hours, so in essence it could be another hour. However, what if the last time it ran, it only took one hour? Questions such as: 'How far is it?'; 'How many records has it already processed?'; 'Is it a multi-step process?' and 'How many steps have already been completed?' come to the fore.

There are very good reasons for having this information on hand. Firstly, if it is a single large transaction, and it has already inserted millions of records, and by chance it stops at this point - it may take longer than three hours to roll back. If it is five minutes away from finishing, the techie would rather let it finish. On the other hand, if it has only inserted five rows, then a rollback would be quicker, and the task could be restarted once the underlying problem is fixed.

In most databases, the underlying database management system (DBMS) allows the user to view more information about processes that are currently running. In Oracle, for example, a number of system catalogues can be queried that allows the user to see active sessions as well as their resource utilisation. The default catalogues can then help to understand what is happening to the queries. Database administrators use these to analyse and monitor performance in general.

Info on the move

Additionally, it is possible to add extra information to 'expose' more run-time information into these Oracle catalogues specifically. This means that information can be accessed about the process as it is running, as opposed to only being able to report on it afterwards.

This allows the user to monitor the progress of tasks.

Cor Winckler is technical director at PBT.

The developer of ETL code can therefore set the following fields in v$SESSION: MODULE, ACTION and CLIENT_INFO. This can be done through undertaking the relevant procedures in the DBMS_APPLICATION_INFO package. If this is done consistently and diligently by all developers, then users can get an immediate glimpse of what is running on the system with a simple query from v$SESSION, or by using the session browser in a tool like SQL DEVELOPER or TOAD.

Another very handy piece of Oracle instrumentation available to developers is the use of the SET_SESSION_LONGOPS procedure. This makes it possible to manipulate a user-defined long operation, which will come out in v$SESSION_LONGOPS and show a simple status of a given task, for example:

1. Insert_1: FCT_SALES: 45032 of 567000 rows done
2. File: ARCHIVE: 15 of 78 files done
3. Step: PROCESS: 3 of 9 steps done

With the above information at the techie's fingertips, it would be much easier to answer the manager when they want to know how far a particular task is in the process.

As a matter of fact, the very next step would be to build a small dashboard that can visualise everything from run-time statistics as they are running, to historical statistics, for insight into the bigger picture. However, the dashboard is another topic that I will discuss at a later stage.

In conclusion, the extraction process as well as all downstream operations in the ETL process provide many benefits, including efficiency, as it allows for the extraction of much smaller volumes of data in a quicker timeframe - certainly a key benefit to the overall process of BI.

Share