Can't Connect to a Service: Linux Troubleshooting Guide

Summary: A guide to help solve the problem of when “A” cannot communicate with “B” in Linux.

When interviewing, many troubleshooting scenarios are this one at the end (often disguised). For example they give you the prompt: “the website is down”, and after investigating you find out that the web application connection to the database fails.

This is an example of a systematic troubleshooting – using “divide and conquer” – that touches a variety of troubleshooting techniques in a Linux environment.

Three Parts: Client - Network - Service

A network connection to a service can be loosely divided in three parts: Client <-> Networking Communication <-> Service

Where:

The Client can be an end user, like a laptop from someone at home, or it can be a server in the cloud or in a data center.
Networking can include the Internet or be all inside a cloud provider.
The Service is well, a service exposed in a server, like a web server or a database server.

On the client side, there are not a lot of interesting things to make things wrong other than having a local firewall or perhaps some uncommon configuration (an example would be having an http_proxy env var and using curl). If a particular user or ourselves is having a problem, a first step for incident management is to verify the issue, and this can be done with tools like Down for Everyone or Just Me if the target service is exposed to the Internet, or by using other different clients.

For the networking part, for end users or ourselves doing troubleshooting there can be issues with a local WiFi router or particular ISP for a specific or a few users, and those issues can be treated the same as the “client issues” in the previous paragraph. As for the networking itself in general, as in for example a VM in the cloud trying to connect to another VM, the topic is so vast that it won’t fit in a book, let alone an article. VPCs, subnets, routing, gateways, peering, VPNs… there’s an explosive combination of things that could be set up wrong. We’ll assume that IP connectivity is there (we can check with other servers in the same subnet as the target) and focus on Linux server problems.

Linux Server

We’ll focus on the server where the service is supposed to be running.

Quick Linux Server Review

It’s never a bad idea, especially if we are not familiar with the server, to run a quick “one minute” server review.

one minute linux server review

Here we are dividing the commands in three sections:

In blue, the purpose of the server, what it does (what ports are exposed).
In red, how saturated or busy its hardware resources are.
In green, the possible server and application errors.

Those commands are examples (some are getting obsolete or are distro-dependent); you should make your own list and practice them so they become muscle memory.

Is the Service Running?

Check if the service is running with nestat -tlpn or ps (like ps -auxf) or systemctl status <service> or curl.

Service Not Running

If the service is not running:

Check logs, both application logs and OS logs.
Try and start the application and see log files if it won’t start.
If the application won’t start and there is no indication of the problem in the logs, you can use strace to investigate what happens when starting the service.

Typical sources of problems when an application won’t start are:

Configuration issues. Check the application configuration file, make sure it’s the one you think it is (review systemd service file and sometimes but not always the config file is revealed with lsof). Some applications provide a tool to verify the syntax of their configuration file. Also a technique you can do is to use the default application config file (or the simplest one possible) to confirm if it’s a configuration problem.
Hardware resources. Common examples are: not enough RAM, out of disk space and software limits (ulimit).
Sofware “bug”. A stack trace in the application log will be an indicative of this for instance.
Dependencies. A dependency may be missing, like a local package or library or a third-party API service can’t be reached or is not available.

If the service runs only for a while:

If the application runs for a period of time and then at intervals it crashes, this is typically indicative of a resource leak, most commonly a memory leak (a file descriptor leak could be another example).

A graph pattern that is a sign of memory leak is the “saw teeth diagram”, where the RAM consumption goes up steadily until the application or the server runs out of memory (OOMs) and the line in the graph abruptly goes down, to start the cycle again.

Service Is Running

If the application is running, check with the client tool that matches the service; for example mysql for a MySQL database or pg for a PostgreSQL database etc. If you can check the code or configuration, try to use the same options as in the application (for example, database connection string including hostname, port, database, user and password).

In most cases you can use curl or netcat nc regardless of service type; as in you can use it if needed against say, a database, to test connectivity. The connection at the application level won’t work but we can see if the service replies (and cuts the connection).

Also if curl “hangs” (doesn’t reply anything), this is almost always a sign of a firewall or a type of network filtering.

As general advice, don’t use ping to test connectivity, since it runs on a lower protocol (ICMP) and it’s blocked sometimes, so a negative “ping” doesn’t mean much. Also for example, Kubernetes Services don’t reply to pings, since they are virtual IPs.

Also try to use IP addresses instead of hostnames, or use both to rule out or confirm possible DNS issues. Also remember that 127.0.0.1 is not the same as localhost for firewalling and other purposes.

No connection. If the service is running but there’s no connectivity, check logs and local firewall (iptables/netsec).

Tries to connect. If the service “tries” to connect but it doesn’t go on to reply properly, this can be an authentication issue (see server message and logs) or other problems as in the “doesn’t run” section. It can also happens that the service may be slow to respond or times out due to CPU starvation (the “one minute server review” should have revealed this possibility) or a dependency service timing out.

Connects properly. If it connects and responds fine locally in the server but not from the outside, this is a networking problem. A common issue is a cloud or infrastructure-level firewall or network filtering outside the server (e.g. in AWS, a security group or NACL), an indication of this as mentioned is curl “hanging” when trying to connect from the outside. You can also use tcpdump to check in the target server if the requests from the outside are making to the server.

Troubleshooting