Debugging Strategies

Most debugging is done by guess-and-check, which is why most debugging is slow. Effective debugging is a structured process: reproduce the bug, narrow its location, form a hypothesis about the cause, verify, fix, confirm. Each step has techniques that work better than guessing.

This page is about the systematic approach, the specific techniques that compound into faster diagnosis, and the patterns that catch even experienced engineers.

The four-step framework

1. Reproduce

A bug you cannot reproduce is a bug you cannot fix reliably. Time spent making the bug reproducible is rarely wasted; you can iterate on potential fixes only after you have a reproducible test case.

Two kinds of reproduction:

Minimal: the smallest input or sequence that triggers the bug. The minimal case is what you want for debugging; full reproductions take time and obscure the cause.
Reliable: the bug happens every time you run the case. Intermittent bugs become priority-zero work to make reliable before debugging.

If a bug seems unreproducible: look for environmental dependencies (timing, ordering, concurrency, state, time of day, configuration). The cause is often there.

2. Narrow

You have the bug reproducible. Now find where it happens. Two techniques dominate:

Bisection

Halve the suspect space repeatedly. If you have a recent set of changes that introduced the bug, git bisect literally halves the commit history — log(n) operations to find the offending commit.

Bisection also works on code regions: comment out half the function, see if the bug persists, narrow accordingly.

Inspection at boundaries

Add logging or assertions at function boundaries. Watch which boundary first sees bad data. The transition is your bug location.

3. Hypothesize

You found where the bug happens. Now form a hypothesis about why. The hypothesis must be testable.

Bad hypothesis: "It's a memory issue." Good hypothesis: "The for-loop on line 47 is processing entries in reverse order, so when entry 5 is processed, entries 0–4 have already been freed."

The hypothesis names a specific mechanism. The mechanism makes a specific prediction.

4. Verify

Test the hypothesis. The fix is not the verification; the verification is the test that confirms the mechanism and the test that confirms the fix resolves it.

If the hypothesis was wrong, you learned something — go back to step 3 with new information. Wrong hypotheses are part of the process.

High-leverage techniques

Print debugging is fine

Print debugging gets disrespected in favor of debuggers, but print debugging is often faster for the kinds of bugs that span function boundaries or involve timing. The bias against print debugging is mostly cultural.

The trick: print enough to know what is happening, not so much that signal drowns in noise. A few well-placed prints with structured output beats grep through 10MB of log dumps.

Debuggers are best for state inspection

A debugger shines when you need to inspect data structures at a specific moment. Stepping through code line-by-line is rarely the right use; setting a breakpoint at the suspect location and examining state usually is.

Logging at boundaries

Permanent structured logs at function boundaries let you debug production issues without re-deploying. The logs cost nothing in normal operation; they pay back the first time you have to debug a customer-reported issue.

Tracing for distributed systems

Distributed traces (OpenTelemetry, Jaeger) show how a request flows through services. For bugs that span service boundaries, traces are essentially required — print debugging across services does not work.

Observability beats monitoring

Monitoring tells you the bug is happening. Observability lets you understand why. Modern observability tools (tracing, structured logs, metrics tied to traces) can convert "the API is slow sometimes" into "the API is slow when this specific code path triggers, which happens when these inputs arrive."

Specific patterns

"It works on my machine"

Different environments have different state, configuration, dependencies. The bug exists; the question is what your machine has that production does not (or vice versa). Common causes: stale data, different OS, different timezone, different file permissions, environment variables, cached state.

Race conditions

The bug appears under load but not in isolation. Or it appears intermittently with no clear pattern. Race conditions are real but over-attributed; a "race condition" is the wrong diagnosis when the underlying cause is shared mutable state without synchronization. Look for the shared state first.

Heisenbugs (the bug disappears when you debug)

The act of debugging changes the bug. Common causes: timing changed by added prints, optimization disabled in debug builds, the bug depended on uninitialized memory now zero-initialized.

Schroedinbugs (impossible bugs)

Bugs in code that should never have worked. Often appear after a change that fixes some other bug — the original code was working "by accident." Look for the latent bug; the change just exposed it.

Bugs that come and go

Often environmental: a flaky test that depends on filesystem state, a service that depends on an external system that has its own outages. Track them down before declaring the bug "intermittent and ignorable."

The patterns that produce most bugs

Most production bugs fall into a small number of categories:

Off-by-one errors in loops and ranges
Null/None handling at unexpected callsites
Concurrency issues with shared mutable state
Time and timezone edge cases
Encoding issues (Unicode, file format, character sets)
State machine transitions in invalid orders
Resource cleanup (connections, file handles, memory)
Configuration drift between environments

When stuck on a bug, mentally walk through these categories. The cause is often in one of them.

What not to do

Random fix attempts. "Maybe it's this" without a hypothesis usually wastes time.
Adding catches that swallow exceptions. This converts a clear bug into a silent one. Worse than the original bug.
Rewriting the function "to be cleaner." The bug is usually a specific mechanism; rewriting may move it without fixing it.
Believing the user is wrong. They might be, but the assumption costs more time than checking.
Stopping after the first thing fixes the symptom. The first fix is often a workaround that hides a deeper issue. Verify the root cause.