Chaos Engineering

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production. It is not "breaking things in prod"; it is a **scientific experiment** to verify resilience hypotheses.

The Four Steps of a Chaos Experiment

1. **Define the Steady State:** Identify a measurable metric that indicates the system is healthy (e.g., "p99 latency < 200ms" or "HTTP 200 rate > 99.9%").

2. **Form a Hypothesis:** "If we kill one of the three database replicas, the steady state will not change."

3. **Introduce a Variable (The Fault):** Inject a failure (e.g., terminate a node, inject 500ms of network latency).

4. **Try to Disprove the Hypothesis:** If the steady state is affected, you have found a resilience gap.

Blast Radius Management

Never start with a "Chaos Monkey" that kills random production nodes. Use the **Blast Radius** progression:

- **Dev/Stage:** Break it here first. If it fails, fix the architecture.

- **Canary:** Break it for 1% of users.

- **Production:** Break it for everyone, but only after it has passed the Canary test.

Common Chaos Experiments

| Target | Fault | Hypothesis |

|---|---|---|

| **Network** | Latency Injection | "The circuit breaker will trip and fall back to cache." |

| **Storage** | Disk Full | "The application will gracefully degrade to read-only mode." |

| **Compute** | CPU Hog / OOM | "The load balancer will health-check the node out of rotation." |

| **DNS** | Resolve Failure | "The secondary DNS provider will take over automatically." |

Implementation: Chaos Mesh (Kubernetes)

For teams on K8s, **Chaos Mesh** is the industry standard. It allows you to inject faults via CRDs (Custom Resource Definitions) without changing application code.

```yaml

Example: Network latency injection

apiVersion: chaos-mesh.org/v1alpha1

kind: NetworkChaos

metadata:

name: network-delay

spec:

action: delay

mode: one

selector:

namespaces:

- default

labelSelectors:

'app': 'my-web-app'

delay:

latency: '200ms'

jitter: '50ms'

duration: '5m'

```

The "Game Day" Discipline

A Game Day is a scheduled 2-4 hour window where the engineering team runs a series of chaos experiments.

- **Roles:** One person is the "Chaos Engineer" (injects faults); one is the "Incident Commander" (responds); one is the "Scribe" (records timestamps and metrics).

- **Goal:** Not just to find technical bugs, but to test the **Human Response**. Does the alert fire? Is the runbook accurate? Does the team know where the dashboard is?

Anti-Pattern: Chaos without Observability

If you inject a fault and your dashboards don't show any change—but your users are complaining on Twitter—you have a **Blind Spot**. Chaos Engineering is as much about testing your monitoring as it is about testing your code.

Further Reading

- [BlamelessPostMortems](BlamelessPostMortems) — Documenting the findings from a Game Day.

- [ServiceLevelAgreements](ServiceLevelAgreements) — Defining the "Steady State" metrics.

- [CircuitBreakerPattern](CircuitBreakerPattern) — The primary defense against cascading failures.