This post is about how we handle failures and service degradation on our platform to learn about how to improve it for our customers
At Staffbase, we maintain our live environments that are used by millions of users ourselves. Basically, it’s the DevOps approach we are following: You build it! You run it! Besides this, we have high demands on the stability and availability of the platform. But, the moment a service on that platform experiences a service degradation or even crashes, we might not immediately go to solve the issue, e.g.by restarting the service.
What? Wait! A service crashed and you might do nothing? Ah, the service is restarted automatically, right?
This might happen in general. If we just restart the service, that would bring no added value for us as well as for our customers. But, the failure might happen again and again. In case the service goes back to normal soon, there is no good reason to restart it at all. If the unhealthy state of the service persists, we dive in to investigate: Why did this just happen?
Really? Why?
We try to understand first why a service failed. Is it for technical or functional reasons? Is the failure caused by a user or a data issue? What is the effect of the failure? Did it e.g. harm data? What can we learn from our logs or traces?
One goal of this investigation is to improve our platform. Another goal is about finding the right actions to solve an issue since “trial and error” is probably not a good idea when it comes to keeping the Staffbase platform stable for millions of users.
There are for sure root causes for failures that are outside of our sphere of influence: Just imagine a general network outage. But, there is more: Sometimes there is a non-fatal failure you don’t get your hands on.
Imagine you provide a cloud based platform to your customers that consists of hundreds of service instances that interact with each other. The functionality of each and every service has been tested thoroughly. All good here.
There are probably thousands of influencing factors that might affect the reliability of the Staffbase platform. What we as an organization can’t test holistically in a cloud based environment is the impact of these influencing factors. These influencing factors are not isolated from each other. So, their combination might introduce failures on the platform.
OK. What failures exactly?
Here are some common examples:
You can’t point to a specific line of code and say: “I’ve found the source of the failure.”
This is a gray failure.
A slight performance degradation here and a load peak there. Is this really that bad?
It is, since this might sum up and make a cloud based platform unusable.
OK. Got it. This might get really bad. But, what about Chesterton’s fence you mentioned in the title?
Chesterton’s fence is a parable by Gilbert Keith Chesterton about second order thinking:
A fence is erected across a road. The more modern type of reformer goes gaily up to it and says, “I don’t see the use of this; let us clear it away.” To which the more intelligent type of reformer will do well to answer: “If you don’t see the use of it, I certainly won’t let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it.”
The mindset of the second reformer who asks the first one to understand the cause of an issue first before acting is the key. And, it’s easily applicable in software engineering.
In case of performance degradation of one service, one could add more instances to balance the slower performance (1). This might really help. In case the source of the low performance is in another service, adding more instances of the slow service might break your platform completely if the other service can’t handle any request any longer (2).
The difference between (1) and (2) is knowledge about the real source of an issue. So, having this in mind, what we do at Staffbase is dig into details.
At the end we might discover a query that is called often and provides query results quite slowly. Didn’t we test this query thoroughly before? Yes, we did. But the slow response originates from quite unusual query parameters we didn’t have in mind originally. So, adding more service instances or restarting some services would show actionism but might make things worse at the same moment.
So, we might need to introduce a code change for the query or additional indexes to the resource to solve the issue.
Trying to find the root cause of sporadic or random failures is how Staffbase tackles so-called gray failure on its platform. The insights we gain from this approach are way more valuable for us and our customers than trying to solve the issue by more or less random actions. Applying Chesterton’s Fence comes at its cost at the beginning when introducing this approach in an organization, but pays off quickly. It will work for your organization as well.