Chaos Engineering

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production. It is not "breaking things in prod"; it is a scientific experiment to verify resilience hypotheses.

The Four Steps of a Chaos Experiment

Define the Steady State: Identify a measurable metric that indicates the system is healthy (e.g., "p99 latency < 200ms" or "HTTP 200 rate > 99.9%").
Form a Hypothesis: "If we kill one of the three database replicas, the steady state will not change."
Introduce a Variable (The Fault): Inject a failure (e.g., terminate a node, inject 500ms of network latency).
Try to Disprove the Hypothesis: If the steady state is affected, you have found a resilience gap.

Blast Radius Management

Never start with a "Chaos Monkey" that kills random production nodes. Use the Blast Radius progression:

Dev/Stage: Break it here first. If it fails, fix the architecture.
Canary: Break it for 1% of users.
Production: Break it for everyone, but only after it has passed the Canary test.

Common Chaos Experiments

Target	Fault	Hypothesis
Network	Latency Injection	"The circuit breaker will trip and fall back to cache."
Storage	Disk Full	"The application will gracefully degrade to read-only mode."
Compute	CPU Hog / OOM	"The load balancer will health-check the node out of rotation."
DNS	Resolve Failure	"The secondary DNS provider will take over automatically."

Implementation: Chaos Mesh (Kubernetes)

For teams on K8s, Chaos Mesh is the industry standard. It allows you to inject faults via CRDs (Custom Resource Definitions) without changing application code.

# Example: Network latency injection
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      'app': 'my-web-app'
  delay:
    latency: '200ms'
    jitter: '50ms'
  duration: '5m'

The "Game Day" Discipline

A Game Day is a scheduled 2-4 hour window where the engineering team runs a series of chaos experiments.

Roles: One person is the "Chaos Engineer" (injects faults); one is the "Incident Commander" (responds); one is the "Scribe" (records timestamps and metrics).
Goal: Not just to find technical bugs, but to test the Human Response. Does the alert fire? Is the runbook accurate? Does the team know where the dashboard is?

Anti-Pattern: Chaos without Observability

If you inject a fault and your dashboards don't show any change—but your users are complaining on Twitter—you have a Blind Spot. Chaos Engineering is as much about testing your monitoring as it is about testing your code.