The term data engineer has gained significant popularity recently. In the business intelligence / data warehouse (DWH) industry, we used to refer to ETL developers. However, today, we are seeing a great deal more focus on this new role − the data engineer.
What exactly is this role, and is it merely a rebranding of the original ETL developer role?
The fun stuff
Understanding cloud and big data engineering technologies is of course critical for all data engineers.
These are important enablers for many modern-day use cases, which include message queuing, stream processing, API integration and distributed data processing.
Message queues are an important data ingestion capability. In the DWH world, they are gaining popularity as a data ingestion capability given that message queues facilitate real-time application integration.
This often comes at a fraction of the cost and can have little to no impact on the performance of the underlying system, which can be a real game-changer, for both operational systems and DWH environments.
Stream processing is the processing of data as it is created or received. While it can also include complex processing and aggregation logic on these streams, it is a critical enabler of real-time analytics and event-driven architectures.
So, it’s no surprise that stream processing frameworks are growing in popularity, especially given that stream processing enables use cases such as fraud detection, sentiment analysis and log analysis. There are many other use cases to choose from here as well, but in all these use cases, knowledge of stream processing is a key enabler.
Many systems these days expose their data through APIs. This is particularly true for cloud-based systems.
It is an absolute must for data engineers to be fluent in parallel, distributed toolsets and frameworks.
As such, data ingestion routines that can integrate with APIs are becoming commonplace. Cloud-based tools and platforms expose their functionality primarily through APIs, which is also true for many third-party open source tools and languages.
What is missing?
We have discussed some of the ‘newer’ concepts currently being used, and while they are not so new anymore, there are still gaps that need to be filled.
Many of these topics deal with specific use cases, and often, with the acquisition of data. However, what about the transforming, storage and loading of data in distributed environments? Especially if we consider that distributed storage and processing is the workhorse of cloud and big data platforms.
It is therefore an absolute must for data engineers to be fluent in parallel, distributed toolsets and frameworks. This can be as simple as Hive SQL and PIG, but it can also range to the more complex examples of programming in Python, Java, or Scala on the Spark platform.
In all cases, understanding how data is distributed and processed within these platforms is a critical skill to have.
Remember the first principles
One of my favourite mantras is to never forget your origin, first principles. In this case, we must not forget the original skills that we learnt as ETL developers.
We are still going to have to build operational data stores, data vaults or data warehouses. While the platform might have changed, many of the solution requirements still exist.
As a result, it doesn’t matter what new technology is used, or if it is in the cloud or based on big data – we still need the core skills that made this possible.
So ask yourself:
- How do we build type one and type two-dimension loaders?
- How do we take a relational model and add the paradigm of time to it?
- How do we integrate data from many, disparate source systems?
- How do we build robust ETL pipelines that can handle incoming data quality issues?
- How do we handle ETL restart ability?
- How do we identify and manage our deltas?
Now extend the first principles
Traditional ETL skills are still a must-have; however, extending this to work in cloud and big data platforms is the next stage of ETL evolution.
So, the question is, how do we manage slowly changing updates in a platform that does not fundamentally process updates? How do we optimise our row-based processing? How do we optimise our high-volume processing? How do we take advantage of the new technology to simplify complex logic such as restart ability, delta management, etc?
Competence in the core ETL skills is inviolable, as it is the foundation of a good data engineer. While all the other skills are critical, they are complementary and often use-case-specific.
As a result, my view is that companies should not look to grow, or recruit, based on these new skills alone, but rather make sure that it always has the core skills required first, and extend (or search for) these skills based on the relevant use case.
Above all other, avoid looking for heroes. Remember, a jack of all trades is a master of none.