Load Testing Strategies
In distributed systems engineering, **Load Testing** is the empirical process of subjecting a system to anticipated peak operational volume to observe its behavior, identify latency bottlenecks, and validate capacity planning.
1. Workload Models: Open vs. Closed
A fundamental error in load testing is selecting the wrong workload model.
Closed Workload Model
In a closed system, a fixed number of concurrent virtual users (threads) loops through requests. A new request is only sent *after* the previous one completes.
* **The Trap:** If the server slows down, the test tool slows down. The load drops precisely when the system is under stress, masking the true failure point.
* **Use Case:** Validating connection pooling limits or testing legacy synchronous systems.
Open Workload Model
In an open system, requests arrive at a predefined arrival rate (e.g., 500 requests per second), regardless of how fast the server processes them.
* **The Benefit:** Accurately simulates internet traffic. If the server slows down, requests queue up, realistically exposing thread exhaustion, memory leaks, and cascading failures.
* **Tools:** Modern tools like [k6](https://k6.io/) and Gatling excel at generating open workloads.
2. The Danger of Coordinated Omission
**Coordinated Omission** occurs when a load testing tool inadvertently coordinates with the system under test to omit latency spikes from its measurements.
If a test tool expects to send a request every 10ms, but the server pauses for 100ms (e.g., during a Garbage Collection pause), the tool might silently skip sending the 9 requests that should have occurred during that pause. The resulting report will completely hide the 100ms latency spike, presenting a falsely optimistic 99th percentile (p99) latency.
To mitigate this, SREs must use testing tools engineered to correct for coordinated omission (like `wrk2` or hyperfoil) and rely on server-side metrics (Prometheus, OpenTelemetry) rather than client-side aggregates.
3. Types of Performance Tests
* **Load Testing:** Assessing behavior at expected peak volume.
* **Stress Testing:** Pushing the system beyond its limits to observe its failure modes and recovery gracefulness.
* **Soak (Endurance) Testing:** Running a moderate load for an extended duration (hours or days) to detect memory leaks and resource exhaustion.
* **Spike Testing:** Simulating a sudden, massive influx of traffic (e.g., a Black Friday sale) to test auto-scaling and circuit breakers.