Distributed Tracing

Distributed tracing is the capture of a request's lifecycle as it traverses service boundaries. Each segment of work is a span, and the entire tree of spans for a single request is the trace.

Core Mechanics

Tracing relies on three pillars: Propagation, Instrumentation, and Aggregation.

1. Context Propagation

The traceparent header (W3C standard) must be passed between all services. It prevents "trace fragmentation" where a single request appears as disconnected spans.

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             ver trace-id (32 hex)             span-id (16 hex)  flags

2. Instrumentation

The industry standard is OpenTelemetry (OTel). Manual instrumentation is required for business-critical operations that cross multiple async or framework boundaries.

Concrete Example (Java/OpenTelemetry):

// Manual span creation for a complex business operation
Span span = tracer.spanBuilder("process-order")
    .setAttribute("order.id", order.getId())
    .setAttribute("customer.tier", customer.getTier())
    .startSpan();

try (Scope scope = span.makeCurrent()) {
    // Perform work...
    validateOrder(order);
} catch (Exception e) {
    span.setStatus(StatusCode.ERROR, "Order validation failed");
    span.recordException(e);
    throw e;
} finally {
    span.end();
}

Sampling Mathematics: The Cost Lever

Tracing generates massive data volumes. At 1,000 requests per second (RPS), with 20 spans per request and 500 bytes per span, the daily uncompressed volume is:

1000 \text{ req/s} \times 20 \text{ spans/req} \times 500 \text{ bytes/span} \times 86400 \text{ s/day} \approx 864 \text{ GB/day}

Head-based vs. Tail-based Sampling1. Head-based: Sampling decision is made at the start of the request (e.g., sample 1%).

Pros: Low overhead, predictable cost.
Cons: Misses outliers and rare errors.

Tail-based: All spans are buffered; the decision is made after the request finishes.
- Strategy: Keep 100% of errors, 100% of slow requests ( $>P95$ ), and 1% of healthy requests.
- Math: If $E$ is error rate (2%) and $S$ is slow rate (5%), total data kept is$2% + 5% + (93% \times 1%) = 7.93% $. This provides\$ 10 \times$ better signal-to-noise than 10% head-based sampling for the same cost.

What to Span

Do not span every function. Focus on:

IO Boundaries: HTTP, gRPC, DB, Cache, Queue.
Critical Path Logic: Complex calculations or ML inference.
Resource Contention: Lock acquisition/release.

Common Trace Patterns

Pattern	Detection	Fix
N+1 Queries	Trace shows many small, sequential DB spans.	Implement batching or joins.
Silent Retries	Multiple identical child spans for one logical request.	Check retry policy; ensure idempotency.
Clock Skew	Child span appears to start before parent.	Sync via NTP/PTP; use tracer-specific skew correction.
Gaps in Timeline	Large time gaps between spans.	Uninstrumented work (CPU/GC) or network latency.

Implementation Strategy

Standardize on W3C Trace Context.
Inject Trace IDs into logs. Use the log's trace_id field to jump from an error log to its trace.
Use a Collector. Never send traces directly from the app to the backend; use an OpenTelemetry Collector for buffering and tail-sampling.
Choose a Backend:
- Tempo/Jaeger: For self-hosted/cost-conscious.
- Honeycomb: For high-cardinality exploration.