DistributedTracing

Distributed tracing is the capture of a request's lifecycle as it traverses service boundaries. Each segment of work is a **span**, and the entire tree of spans for a single request is the **trace**.

Core Mechanics

Tracing relies on three pillars: **Propagation**, **Instrumentation**, and **Aggregation**.

1. Context Propagation

The `traceparent` header (W3C standard) must be passed between all services. It prevents "trace fragmentation" where a single request appears as disconnected spans.

```

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

ver trace-id (32 hex) span-id (16 hex) flags

```

2. Instrumentation

The industry standard is **OpenTelemetry (OTel)**. Manual instrumentation is required for business-critical operations that cross multiple async or framework boundaries.

**Concrete Example (Java/OpenTelemetry):**

```java

// Manual span creation for a complex business operation

Span span = tracer.spanBuilder("process-order")

.setAttribute("order.id", order.getId())

.setAttribute("customer.tier", customer.getTier())

.startSpan();

try (Scope scope = span.makeCurrent()) {

// Perform work...

validateOrder(order);

} catch (Exception e) {

span.setStatus(StatusCode.ERROR, "Order validation failed");

span.recordException(e);

throw e;

} finally {

span.end();

}

```

Sampling Mathematics: The Cost Lever

Tracing generates massive data volumes. At 1,000 requests per second (RPS), with 20 spans per request and 500 bytes per span, the daily uncompressed volume is:

$$1000 \text{ req/s} \times 20 \text{ spans/req} \times 500 \text{ bytes/span} \times 86400 \text{ s/day} \approx 864 \text{ GB/day}$$### Head-based vs. Tail-based Sampling

1. **Head-based:** Sampling decision is made at the start of the request (e.g., sample 1%).

- **Pros:** Low overhead, predictable cost.

- **Cons:** Misses outliers and rare errors.

2. **Tail-based:** All spans are buffered; the decision is made after the request finishes.

- **Strategy:** Keep 100% of errors, 100% of slow requests ($>P95$), and 1% of healthy requests.

- **Math:** If$E$is error rate (2%) and$S$is slow rate (5%), total data kept is$2\% + 5\% + (93\% \times 1\%) = 7.93\%$. This provides$10 \times$ better signal-to-noise than 10% head-based sampling for the same cost.

What to Span

Do not span every function. Focus on:

- **IO Boundaries:** HTTP, gRPC, DB, Cache, Queue.

- **Critical Path Logic:** Complex calculations or ML inference.

- **Resource Contention:** Lock acquisition/release.

Common Trace Patterns

| Pattern | Detection | Fix |

|---|---|---|

| **N+1 Queries** | Trace shows many small, sequential DB spans. | Implement batching or joins. |

| **Silent Retries** | Multiple identical child spans for one logical request. | Check retry policy; ensure idempotency. |

| **Clock Skew** | Child span appears to start before parent. | Sync via NTP/PTP; use tracer-specific skew correction. |

| **Gaps in Timeline** | Large time gaps between spans. | Uninstrumented work (CPU/GC) or network latency. |

Implementation Strategy

1. **Standardize on W3C Trace Context.**

2. **Inject Trace IDs into logs.** Use the log's `trace_id` field to jump from an error log to its trace.

3. **Use a Collector.** Never send traces directly from the app to the backend; use an OpenTelemetry Collector for buffering and tail-sampling.

4. **Choose a Backend:**

- **Tempo/Jaeger:** For self-hosted/cost-conscious.

- **Honeycomb:** For high-cardinality exploration.