Don't SSH into Production
This article was published in the book 97 Things Every Cloud Engineer Should Know by O’Reilly.
(I’m very aware of the apparent contradiction of publishing this in SadServers, a service based - so far - on working with an “SSH” terminal).
Routine server system administration tasks should be handled with automation and services; through code and software.
Not logging into systems consoles for manual routine maintenance can be seen as an indicator of capability maturity (see Joel Spolsky’s “The Joel Test” for an excellent set of good software engineering practices).
With SSH user logins into critical servers there needs to be an audit process in place to determine who had access and what was done in each server, including hard cases like SSH forwarding or tunneling. This production environment access audit can get complex if as policy there’s frequent use of tasks performed onto servers via SSH.
As a test, before logging into a server to carry out a task, ask yourself:
- Was this task tested first on a dev/QA/test environment first?
- Is this a one-off task (versus a routine task or request)?
If you answer “no” to any of the questions then you want to reconsider your workflow and think of ways to automate away the kind of work you SSH for.
Let’s review some common reasons a cloud engineer would want to log into a server:
To look into logs, like application, container or operating system logs. This is a solved problem. Using a stack like Elasticsearch, Fluentd and Kibana or a third party log service in the cloud will provide log aggregation, search, visualization and permanent storage with proper life cycle and backups. For monitoring, to look at server telemetry like CPU/RAM/disk usage or exposed application performance metrics. This is also a solved problem with a myriad of commercial and Open Source tools at our disposition.
For routine changes in the system, like for instance: making configuration changes, patching the operating system, managing software installations and upgrades, performing backups and restores. All these changes should ideally be done using Infrastructure as Code. We declare in code (which we keep versioned) our infrastructure and modify our changes in code. Then depending on our workflow, philosophy and tooling we can use configuration management tools, or we can re-create the server image, or we can use our favorite coding language and take advantage of the cloud vendor’s SDK or API.
Running tests. “Testing” in production can be needed to get a real view of application behavior; fake test data rarely behaves like the real thing. Or we may need to run a query that is not shown in a reporting server. While these are valid tasks, we should still avoid ad-hoc manual operations and look into replacing them with code and systems that will perform such operations with less risk.
“My server is a snowflake that needs constant TLC”. Then look into “cattle vs pets” because you got some problems.
“I don’t know what is running on this server or what this server is supposed to run”. Then you got bigger problems.
There are a few justified reasons to SSH into a production server that is part of an application running in the cloud.
Sometimes while troubleshooting we need to log into a server as a last measure because there’s a problem where the information we have from the log and metrics servers is not enough to determine the cause of the problem. For example, we are not getting logs or metrics themselves or there are network issues of the type “this host doesn’t seem to be able to talk to this other host” and we want to verify that connectivity. There may be also hard Linux kernel issues, or strange or unexplained behaviour for which there are no logs or indirect information.
Another reason to SSH into servers is exploration or learning for new people in a team.
In any case, next time you are going to log into a server, stop and think: “how could I accomplish this task without manually getting into the server?”.