Understanding the complexities of ETL for data management

By Minette Lubbe, Data engineer, PBT Group.

Johannesburg, 14 Jul 2022

Data plays a critical role in the success of a business. But it is in the extract, transform and load (ETL) tools where a company can unlock the true potential of its data.

These solutions are used to harness data from multiple sources and reshape it to deliver value, while maintaining the integrity of the data.

Of course, integrating an ETL tool into an existing environment is not without its challenges. This makes identifying the right solution for the organisation essential if the business is to avoid fragmented metadata and a lack of compliance. Choosing an effective solution also helps to ensure lower maintenance and training costs.

Getting started

Planning is a key component of any ETL implementation. However, the business must first understand its own environment to gauge the compatibility of an ETL tool to interact with the source and target systems if it is to minimise the impact on the processing time and resources.

Questions to ask include what the intended result will be, how the project must be structured, and identifying the procedures or tasks that must be completed before the data goes to ETL. Analysing the data is therefore a crucial step in this regard.

One of the key best practices in the process is understanding the data requirements of the company.

This requires the company to invest time in understanding the data, the business requirements, and putting a data model in place that provides a clear idea of the steps essential in the process.

Furthermore, data issues must be fixed at the source before the company can attempt to import the data. Getting the source to comply to quality problems will result in a more manageable ETL system.

Operational challenges

The organisation should also limit the amount of data being processed to a subset. In this way it can check the data every step of the way and identify the best ways of handling the process. Drawing a run graph or diagram of the path the development needs to follow can help in this regard.

Data can then be integrated incrementally. Most tables are too big to reload all the data. Using only data applicable to a specific period will not only reduce the processing time, but will also not strain resources. Companies should also consider scheduling, automating and monitoring ETL jobs.

These auditing and monitoring steps are important to identify failures. It is also the only way the organisation can identify if an ETL process is not performing or executing what it is supposed to do.

Throughout everything, a quality assurance process must be implemented at each stage of the process. The final consideration is that of scalability. The company must ensure the ETL process is developed in such a way that it can scale as the business grows and its needs change.

Following best practices

One of the key best practices in the process is understanding the data requirements of the company. As such, data analysts must define a set of results where all the requirements are gathered for the ETL process to be developed.

Data cleaning and master data management are also vital. The source of the data must adhere to governance requirements with the data needing to fit in with the master data management of the organisation. This means the data must represent the true view of the business.

Mapping the source is therefore an essential part of the process. In most cases, a business works with several sources and database connections. It is here where it must consider the data types to use and define, as well as the naming conventions.

Organisations need to establish the dependencies on which lookup tables must be populated before the dimensions of the ETL system are in place. These dimensions must be up to date before certain fact tables can be processed.

Additionally, the growth of data must be anticipated to help ensure the process can speed up when required. A robust ETL solution must be able to execute the process within an acceptable period.

Another best practice to consider is that of alerts. These must be based on failure rather than success. If not, the process will just result in noise that will be ignored most of the time. An ETL environment that can restart the process during any step if something should happen is also important.

Even though most ETL developers dislike documentation, it is an innovative idea to put the non-functional requirements in place and have a document of the inventory of jobs. Furthermore, the documentation should also show which workflows are run at what time and how to recover from failure.

Implementing ETL tools takes a significant amount of work. However, if the appropriate steps are not taken, then an organisation cannot get the maximum benefit of its data.

Understanding the complexities of ETL for data management

Data is critical to the success of any business, but it’s in the ETL tools where a company can unlock the true potential of its data.