Production Incidents
Note: No AI was used in the creation of this article :-)
The cause of all incidents or failures in a working system is ultimately a change.
Let’s put our SRE hat on and see if we can come up with some classification of incidents or root causes so it can help us troubleshoot.
One way to classify failures can be by dividing them into internal or external ones.
An internal or static failure is one where the system, set up in a valid environment, is already broken.
Internal or static failures
Types or examples of internally broken systems:
- Introduction of code bugs, as in code that under some paths or state breaks or responds in an unexpected manner. Db migration.
- Application misconfiguration: an application has a bad configuration file (incorrect file that won’t load properly, missing parameters, parameters that don’t match the setup etc)
- System misconfiguration: for example versioning mismatch issues.
- Permission or security issues: Linux (e.g. files) or tooling (e.g. SELinux, AppArmor etc) or application permission (e.g. Kubernetes) or authentication problems (e.g. insufficient permissions).
- Networking problems, like a service not having a network path to reach another service. Another example is having a new firewall or network filtering rule that prevents access to a service.
- Issues in fundamental networking dependencies like DNS probably deserve their own separate mention.
- Dependency issue: the application requires a missing or broken dependency (could be a service like an API or a code library)
External or dynamic failures
An external or dynamic failure is one where the system in isolation works but due to changes in its environment or changes in time, it fails.
Some types of dynamic failures:
- Capacity problems: hardware resources (RAM and disk/volume mostly, CPU and bandwidth possible as well) are not enough for the current volume of traffic. Exhaustion of operating system resources like file descriptors or available open ports as well. Contention issues (noisy neighbours)
- Self DoS. Settings or code that is there to prevent or mitigate issues but ends up causing bigger problems, this is more typical in complex systems where cascading failures can be created. A simple example of self-sabotage could be having a poorly designed rate-limiting strategy (a static issue that surfaces in a dynamic environment).
- Dependency issue: the application uses an external service (like an API) that fails and the code is not prepared to fail gracefully in this case.
- Time-based: Expired SSL certificates or secrets for example. Misconfigured health checks can be included here as well.
- Hosting issues: the hosting provider having services or data centres going down, or forgetting to pay their bill.
Classifications are useful up to a point and they are partially or totally subjective (see how we classify continents and people arguing on Reddit).
Of course in DevOps/SRE we talk about root causes plural (the Swiss cheese model) since often there is no single clear incident cause, and even if there was technically a single failure (e.g. the disk of a database filled up), there were several processes that failed (e.g. there was no monitoring and alert).
Story time
I worked on a production incident where our Kubernetes RabbitMQ cluster failed without a clear cause. We suspected a capacity problem due to high volumes and spikes of traffic. Restarting the cluster with more nodes as emergency mitigation helped for a bit but then it went down again.
Upon investigation I found out that almost all of the traffic in the cluster was directed to the first RMQ node. What happened is that as a StatefulSet, k8s pods were created in order and the first node-0, being the first created and oldest, would get almost all the traffic. A setting to rebalance traffic automatically (that wasn’t a default on the official or semi-official Helm chart) fixed the issue. So this was a combination of a missed setting (static internal issue) that wasn’t made visible until an external dynamic element (the high volume of traffic) was present.
