Key principles of data engineering

By Julian Thomas, Principal consultant at PBT Group

Johannesburg, 13 Feb 2020

If you have ever dealt with the data specialist organisation I am associated with before, you most likely have heard one of our favourite sayings as it has become something of a mantra for us: “Remember the basics.”

This is less of a reminder, and more of a warning. Forget the basics at your peril.

In this rapidly moving world that we live in, our technology, software and programming choices are growing exponentially. We are ruled and led by the hype cycle, and with that, massive amounts of jargon.

What we find as a result is that too much focus is given to the hype ‒ the next buzzword around the corner ‒ and the less flashy, but ultimately critical “basics” are often forgotten. Or worse, ignored. But, if you get the basics right, everything else will follow.

So, let’s discuss these basics in a bit more detail. After all, when it comes to data engineering, we have a set of principles that we hold dear.

The first principle is that there is no such thing as a best practice. This is quite a bold statement, and many will probably instantly object. But consider that there are always exceptions to the rule, and the larger and more diverse a company becomes, the more frequent these exceptions become.

Automate as much as possible, to reduce risk of failure, and allow the business more time to focus on the important stuff.

We prefer to talk about most appropriate fit, most relevant practice. This implies there might be more than one acceptable way to do something. Adopting this as the first principle ensures the business retains a flexible mindset.

The second principle is to adopt, and adhere to, a consistent data engineering architecture. Flesh out what the solution architecture, standards and solution patterns look like. Socialise, communicate, train, and most importantly, enforce.

A principle or standard is useless if it is not consistently implemented. Failure to enforce standards results in a chaotic environment that is expensive to maintain, with a high degree of risk of human and program failure. All because systems were not built and implemented consistently.

This one seems like a bit of an own goal that is an easy problem to prevent, but one that is massively expensive and complex to fix after the fact.

The third principle is to keep it simple, which reinforces the mantra of going back to basics. Avoid unnecessarily complex solutions. Do not unnecessarily extend or complicate your infrastructure, data and software landscape.

Most importantly, keep coding simple. Avoid large, complex programs. These just become maintenance nightmares. Clumping too much logic in a single program simply means a greater cost and risk when changes are required.

Keeping a simpler, modular approach to implementation results in smaller program units that are easier to code and maintain. These smaller code units are then also easier to sequence together.

The aim here should be to avoid unnecessary dependencies between coding units at execution time. The business should therefore look to keep execution trees as simple as possible, with as little dependencies outside each individual branch as possible.

The fourth principle sounds like a no-brainer, but it has great implications on the business outlook and approach. Cater for the unexpected. When building a data engineering solution, it must be robust. The show must go on. We all know the state of data that we must work with and it is often not great.

So, what happens when the data breaks the code? The solution must be robust enough to handle this gracefully. This implies relooking the development approach and adopting a test-driven development approach for a more robust solution.

However, at the same time, the business must also adopt a better testing approach ‒ do not simply focus on positive testing, but also negative testing as well. The goal should be that the only thing that can cause code to crash is a system failure. No data should crash the solution and it should be able to handle any data anomaly it encounters with ease.

The fifth principle, just like Agile, is all about metrics. If metrics are not compiled, the business does not know how it has performed. Just like an Agile team compiles metrics to help measure performance and plan future work, so too should metrics be compiled on aspects like processes performed, how much data has been loaded, associated quality metrics and scores, etc. This is vital in helping to manage solutions.

The sixth and final principle for the purpose of this article is to avoid unnecessary work. Don’t manually build redundant, repetitive code. Optimise, automate, implement meta-data-driven solutions.

Too many hard-coded, manual solutions are prone to human failure. Automate as much as possible, to reduce risk of failure, and allow the business more time to focus on the important stuff.

This is a difficult topic to condense; however, I do hope this has given you something to consider. If not, then I am sure you will be meeting me (or someone like me), sometime soon in the future, and we will be discussing this in much greater depth and likely under a bit more pressure.

Key principles of data engineering

Too much focus is given to the hype and buzzwords, often resulting in the less flashy but critical data engineering basics being forgotten.