Test Your Infrastructure with Game Days

October 8, 2025·

Brought to you by SadServers – Hands-On Linux & DevOps; Real Challenges. Real Infra. Real Skills.

This article was published in the book 97 Things Every Cloud Engineer Should Know by O’Reilly.

What is a Game Day

“You don’t have a backup until you have performed a restore” is a good aphorism and in a similar way, we can say that your service or infrastructure is not fully resilient if you haven’t tried breaking it and recovering.

A Game Day is a planned rehearsal exercise where a team tries to recover from an incident. It tests your readiness and reliability in the face of an emergency in a production environment.

The motivation is for the teams and the code to be ready when incidents occur, therefore you want the test incident to resemble a real life incident.

When you run these experiments in production environments and in an automated way, this is called Chaos Engineering .

Planning for Game Day

There are several items to consider to prepare for a Game Day. Most importantly you need to decide if you are running the exercise in production. Ideally you want production since any staging or test environments are really never going to be the same as production. But on the other hand you have to comply with your SLAs, obtain approval and warn customers if needed.

If you have never done a Game Day or if the target system has never been tested for disruption, then start with a test environment.

Another decision is if you want the procedure to be planned and triggered by an adversarial “red team”, in this case one team or person very familiar with the system will create the failures without warning.

You want to run these Game Days periodically (every four or six months for example), and also after a new service or infrastructure has been added. It may also depend on the recent history of past responses to real incidents.

An incident exercise should run for a few hours at most; you don’t want long-lived lingering effects.

Different types of failure can be introduced at different layers:

Server resources: like high CPU and memory usage.
Application: like processes being killed.
Network: unreliable networking or network traffic degradation: adding latency, packet loss, blocked communication, DNS failures.

Gray failures (degradation of services) are often worse than complete crashes, since the latter has a short feedback loop. Also, degradation can be hard to produce.

Before running a Game Day you need to determine:

The failure scenario or scenarios.
The scope of systems affected and what can go wrong, having a contained “blast radius”.
The “condition of victory” or acceptance criteria for the system to be “fixed”.
The time window for recovery; estimate the duration and add 2x or 3x just in case.
Date and time.
If you are warned beforehand or not.
The team or people on call at the time that will handle the incident.

You also need to prepare the response team(s) and how they are going to work. A common approach to incident management is to have one person focused on solving the problem surrounded by people or teams so she doesn’t have to worry about anything else. For communications, a chat channel is ideally better than email or phone since it works in real-time, allows multiple people to collaborate and leaves a written log.

The incident manager - person leading the response team - doesn’t have to be a manager or the most familiar person with the system but rather is better for knowledge dissemination that is a different person. Besides, you want to make sure you don’t have a “bus factor” of one. If there are runbooks to recover from the incident this is also a way to test such documentation, by having someone different that the author going through it.

You may also want to have a coordinator to answer to business units and executives; you don’t want them asking for updates and distracting the incident manager.

Other teams can be observers; Game Day should be a learning opportunity for all.

After Game Day

Perform a post-mortem to answer questions like:

What did we do and how can we do better?
Did the monitoring tools alerted correctly in the first place and were those alerts routed by the pager system to the person or teams on-call?
Did the incident team have enough information from the monitoring, logging and metrics systems?
Did the incident team make use of documentation like playbooks and checklists?
Did the members of the incident team collaborate well?

If needed, update the technical documentation and procedures, and disseminate the lessons learned.

Don't SSH into Production Destroy and Deploy: the Joys of Immutability