The principles of continuous learning in DevOps

An organisation can translate the improvements of one DevOps team into a catalyst of change for the entire company by using mistakes as a springboard for learning.
Jonty Sidney
By Jonty Sidney, Senior cloud and DevOps engineer at Synthesis
Johannesburg, 18 Nov 2020

If DevOps is a philosophy that aims to change an enterprise’s entire approach to software, it cannot be relegated to just the deployment of software.

Just as Toyota’s LEAN manufacturing process has revolutionised not only manufacturing plants but entire enterprises, so too can DevOps transform how software teams behave and can exponentially increase their ability to deliver high-quality tools for their customers.

The authors of the DevOps Handbook felt the true strength of DevOps lies in how it can inspire teams to build learning into their everyday activities, sharing their knowledge with the broader employee base, pulling the entire technology space (and eventually the entire organisation) into a higher level of productivity and efficiency.

In my previous columns, the first two ‘ways’ of DevOps illustrated how teams can increase efficiency with smaller batches (in other words – continuous integration and continuous deployment) while creating effective feedback loops that allow teams to detect issues sooner.

Once these two milestones have been reached, there is one last piece of the puzzle that needs to be identified: How does an organisation translate the improvements of one team into a catalyst of change for the entire organisation?

In many organisations, mistakes and errors are viewed as an evil − things that should be avoided at all costs. When taken to the extreme, this creates an almost pathological hatred of mistakes and those who make them. The outcome of this is an organisation where employees hide the mistakes they make – through excuses, shifting the blame as well as just denial.

It is critical that a culture shift be made – away from blame, shame and overreactions, and towards a more level-headed, more rational and reasonable management style.

Additionally, to avoid any mistakes, management refuses to listen to new ideas or suggestions on how to improve. “How could we change?” they wonder. The risks are clearly too high to try something different. Unfortunately, the outcome is truly the opposite! It is almost a cliché at this point, but the risk of not changing and experimenting is too high.

This is not to imply that mistakes are a good thing, or that management should not critically analyse suggestions for change. There will always need to be a balance of how organisations innovate and create a culture where mistakes (and their impacts) are viewed in the correct proportion.

How can this balance be achieved? How can an organisation truly walk this tightrope?

The answer – as simple as it may sound – is to spread the knowledge. This is where the ‘third way’ of DevOps enters the enterprise. It should be noted the majority of practices that fall under the third way seem to be common sense. However, as can often be seen, common sense is the least common of all the senses.

The third way emphasises a collection of management philosophies – it would be inaccurate to call them practices – that can be used to help create this environment. Every team and business will need to find how they can be implemented, but they are fundamental to creating a highly efficient DevOps-focused company.

The first of these philosophies is that of organisational learning and safety cultures. In many companies, the question “who caused this outage?” is a warning sign – someone is possibly going to be getting fired. When the result of owning up to a mistake is so extreme, why would anyone ever admit or own up?

In a mature DevOps culture, management wants to know what happened so that they can spread the information throughout the company. They use mistakes as a springboard for learning. The importance of this change in perspective cannot be understated.

The most important effect of this change is what is known as ‘multiplied’ learning. When mistakes happen – and they will – the causes, fixes and results of those mistakes are spread across the company. This is done formally through demos, presentations and the like.

This will often be presented by the person who made the mistake as they are in the best position to explain how it occurred as well as how they fixed it. However, learnings are also spread through the company informally − when employees are not afraid to speak up about mistakes, they will admit and tell people about them. This is especially powerful when teams are not permanent assignments. When the person who made the mistake is moved to another team, they will feel much safer in admitting what they did in previous teams and projects to prevent the same mistake from occurring again.

The second of these philosophies is the institutionalisation of improvement. Often the phrase “workaround” becomes synonymous with “the feature is completed”. How often do developers come across comments in a codebase saying: “We will fix this bug when we get the time” that was written years earlier? This is an anathema to a mature DevOps team. Fixing workarounds, recovering technical debt, improving environments and tweaking code is part of the daily routine and rituals of a DevOps team.

The second way describes how teams bring quality closer to the source, making people responsible for fixing problems and adding to the overall quality of a system. The third way takes this a step further. The third way makes it company policy to always be looking for things to fix. This is not just the responsibility of a technical employee. This is the responsibility of the entire organisation – to look for ways of improving and optimising how the organisation operates.

Connected to these two philosophies is a slightly different understanding of a core component of Dr Eliyahu Goldratt’s Theory of Constraint, in which a massive amount of emphasis is placed on the overall efficiency of a system.

It is not adequate to focus on local optimisations of specific processes. One must improve the global efficiency of the system. When applied to an organisation’s method of learning from mistakes, this approach can be restated somewhat. It is not sufficient for a single local group of people to learn from a mistake or event − local discoveries must be transformed into global improvements.

Finally, the third way discusses an organisation’s approach to resilience. How does an organisation respond to completely unforeseen disasters or situations? The answer is: they make it part of their daily life! The best way to describe this mindset is by an example: Netflix. Most people know Netflix is the primary competitor of mankind’s natural sleeping patterns. However, mature DevOps teams know that Netflix is the gold standard for resilience.

Several years ago, Netflix released the “Simian Army” tool to the open source community. This tool is one it uses to simulate various forms of faults, outages and failures in its Amazon Web Services environments.

Along with the tool, it gave multiple examples of how it is constantly running these tools in its production environments to ensure the system can respond and react to any kind of disaster. While this is not the appropriate forum to discuss these tools, suffice to say that what Netlfix has accomplished is truly remarkable.

With the “Simian Army”, Netflix – and any other organisation that can make use of these tools – can build simulations and scenarios that test a system’s (and a team’s) resilience to its maximum. How do teams know how their system will react to unforeseen circumstances? Simple: they create unforeseen circumstance and observe. Bring an experimental philosophy to the organisation. Do not be afraid to see what happens!

A word of warning though. Some mistakes are serious, and some systems do not respond to being shut down at random (a neat trick the Simian Army can do). This is not a call to remove all consequences from our organisations. However, it is critical that a culture shift be made – away from blame, shame and overreactions, and towards a more level-headed, more rational and reasonable management style. Every organisation needs to find its own sweet spot for this.

However – as with DevOps as a whole – the third way is a call to action. It was never envisioned as a silver bullet or a “one-size-fits-all” methodology. It is a challenge to management, developers, business analysts, testers, quality assurance teams, security and compliance teams and more − to improve the development process and improve the finished products and services that are offered to end-users.

Most importantly, to improve the lives of the people that software development affects; from the junior developer starting his/her career, to the battle-hardened senior who has been working for decades. This is not a quick solution. But it is a highly rewarding one!