Make sweet ETL sounds with SQL

By Mervyn Mooi for Knowledge Integration Dynamics

Johannesburg, 17 Feb 2010

Not all extraction, transformation and loading (ETL) projects perform equally, even though they may use the same tools, interact with the same databases and work with similar data.

The tools are good, even though they are designed to work with a multitude of different databases and other technologies, and therefore cannot perform brilliantly with all of them. The tools offer a visual development environment and deliver predefined processes that save the developers from manually coding or writing processes from scratch, which saves much time.

So why do some ETL exercises fail to perform, fail to reach the nirvana of that rough benchmark of a thousand lines read or 300 inserted every second? Because reality is the ugly cousin of the ideal. You can run a test on a small sample and everything seems fine, but add a million or few lines of data and the whole process slows to a crawl.

There are a host of reasons linked to terms that only the propeller heads will understand. Things like partition tables and indices that should be kept a similar size and which, if they contain null values, will bottleneck performance. Triggers should be disabled in the target database. Indexes should be disabled in tables or partitions before being loaded. Another possible reason is the use of staging tables.

Write your expressions in SQL, however, and the problem can seem to just magically disappear. It's a trick of the trade that the pocket protector clan uses to get more bang for their buck. However, ironically enough for the information community, not everyone is aware of this and other such little gems of knowledge.

SQL can be a very handy ETL developer's friend indeed. Smart application of SQL, the language of databases, can often simplify an ETL job. ETL developers who know about the full systems development life cycle and have a robust knowledge of databases, ETL, and the queries that will be run against the database for the front-end reports, can eradicate major performance issues and complexity.

SQL gives developers the ability to visualise data structures and how they will change, update or generally behave given certain data manipulation language (DML) and data definition language (DDL) commands, so the ETL programmer will have a better appreciation of the ETL input requirements, end results, and possible anomalies that may arise.

So, if SQL is so brilliant, why don't we do away with the ETL tools and just use SQL? The truth is that it can be a useful ally in the battle for better performance, it can be the timely cavalry charge that sways the outcome, but seldom is it the opening rush, the protracted melee, and the final, crushing chase from the field all rolled into one. It is not designed to perform the drudgery of ETL projects, the automation that kicks the legwork out the door. It is the pi'ece de r'esistance. It is the rousing chorus in a heavy metal ballad, the heavy fall of drums in a march, the exquisite cacophony of brass in a jazz beat, the Lady Gaga in a pop tune.

ETL tools automate what SQL demands be done manually. They are easy to use because they offer a visualised or graphical representation of the job. They hide the processing complexity from programmers, which means they don't have to script logic, which can be laborious and can lead to other problems if the developers aren't properly trained. Furthermore, ETL tools better support metadata management.

Without SQL, the entire ETL symphony risks slumping into muzak mediocrity.

Editorial contacts