Network Troubleshooting

Network issues are common; the symptoms vary; the actual cause is often somewhere unexpected. The systematic approach: narrow down where in the network the problem is, before guessing at fixes.

This page covers the diagnostic toolkit and the workflow.

The systematic approach

When a network problem appears:

1. **Reproduce**: confirm the issue. Does it happen reliably or intermittently?

2. **Narrow scope**: client problem? Network path? DNS? Server?

3. **Tools per layer**: each layer has tools to test it

4. **Fix at the right layer**: don't fix server when DNS is the problem

Layer-by-layer tools

DNS

```

dig example.com

nslookup example.com

host example.com

```

`dig` is the most flexible:

```

dig example.com # A record

dig example.com AAAA # IPv6

dig example.com MX # mail

dig +trace example.com # show full resolution path

dig @8.8.8.8 example.com # specific resolver

```

If DNS doesn't resolve, that's the problem. If it resolves but to wrong IP, propagation or config issue.

Reachability

```

ping <host>

ping6 <host> # IPv6

```

Basic round-trip test. If ping fails, host is unreachable or filtering ICMP. Many cloud networks block ICMP.

`ping` doesn't test application — just basic IP connectivity.

Path

```

traceroute <host>

mtr <host> # combines ping + traceroute, continuous

```

Shows each hop. Useful for identifying where packets are lost or delayed.

`mtr` is more useful than `traceroute` for ongoing debugging — refreshes continuously.

Port reachability

```

nc -zv <host> <port> # check if port is open

telnet <host> <port> # interactive

nmap -p <port> <host> # more thorough

```

If port unreachable: firewall, security group, or service not listening.

TCP details

```

ss -tnp # active connections

ss -tlnp # listening sockets

ss -s # summary

netstat -an | grep ESTAB # legacy version

```

Shows what's connected to what. Important for diagnosing connection-pool issues, port exhaustion, etc.

HTTP

```

curl -v https://example.com

curl -I https://example.com # headers only

curl --resolve host:80:1.2.3.4 ... # override DNS

```

Verbose curl shows the full TLS handshake, headers, response. For HTTP-level issues, this is the workhorse.

```

curl --trace-ascii out.txt ... # full trace

```

For deep debugging.

Packet capture

```

tcpdump -i any -nn host <host>

tcpdump -i any -nn -w capture.pcap host <host>

```

Capture and analyze. The .pcap file opens in Wireshark for visualization.

For debugging issues that need to see actual packets — TCP retransmits, missing handshakes, malformed headers.

TLS

```

openssl s_client -connect host:443

openssl s_client -connect host:443 -servername sni.example.com

```

Tests TLS handshake. Useful for cert issues, SNI problems, protocol mismatches.

Common diagnostic flows

"Site is slow"

1. `time curl -o /dev/null https://example.com` — total time

2. `curl -w '@curl-format.txt' ...` — break down by phase

- `time_namelookup` (DNS)

- `time_connect` (TCP)

- `time_appconnect` (TLS)

- `time_starttransfer` (TTFB)

3. Identify which phase is slow; investigate that

"Can't connect"

1. `dig <host>` — DNS works?

2. `ping <host>` — IP reachable? (may be blocked)

3. `nc -zv <host> <port>` — port open?

4. `curl -v https://<host>` — application responds?

If DNS fails, fix DNS. If port closed, fix firewall/security group. If application fails, server-side issue.

"Intermittent failures"

`mtr` for several minutes. Look for:

- Packet loss at specific hop

- Latency spikes

- Routing changes

Often a network mid-path issue that's not your immediate infrastructure.

"TLS error"

```

openssl s_client -connect host:443 -showcerts

```

Examine the cert chain. Common issues:

- Expired cert

- Wrong CN/SAN

- Missing intermediate certs

- Untrusted CA

Cloud-specific tools

AWS

- VPC Flow Logs

- VPC Reachability Analyzer

- Route 53 Resolver query logging

Kubernetes

- `kubectl exec` into pod, run standard tools

- `kubectl logs`

- Service mesh tooling (Istio dashboards, etc.)

Containers

- Network namespaces; `ip netns`

- Inside-container tools may be limited; install on need

Common pitfalls

Different DNS in different places

Local DNS resolver, application DNS cache, VPC resolver — may give different answers.

Caching obscuring problems

Browser cache, CDN cache, DNS cache. When debugging, work to bypass caches.

Logging at the wrong layer

Web server logs don't show network errors. Application logs may not show TLS issues. Look at the right place for the right symptom.

Time differences

Client clock vs. server clock matters for TLS (cert validity windows, JWT expiration).

MTU issues

Fragmented packets through tunnels. Ping with `-M do -s 1472` to test path MTU.

Common failure patterns

- **Guessing instead of measuring.** Assume DNS works without checking.

- **Fixing symptoms, not causes.** Restart the app when the network is the problem.

- **No baseline.** Don't know what "normal" looks like.

- **Logs not preserved.** Issue resolves; logs gone; can't analyze later.

- **Trusting the network is healthy.** It isn't, sometimes.

Further Reading

- [TcpIpFundamentals](TcpIpFundamentals) — What you're diagnosing

- [DnsDeepDive](DnsDeepDive) — DNS-specific

- [LoadBalancingStrategies](LoadBalancingStrategies) — LB-related issues

- [DebuggingStrategies](DebuggingStrategies) — Broader debugging discipline

- [Networking Hub](NetworkingHub) — Cluster index