Blue-Green Deployments
Blue-green is "run two parallel environments; switch traffic from one to the other." It's a deployment strategy with strong rollback guarantees — if the new version misbehaves, switch back, instantly.
The strategy has lost ground to canary / progressive delivery as the dominant pattern, but it's still the right answer in specific cases.
How it works
Two production environments — "blue" and "green" — both capable of serving real traffic. Only one is live at a time.
```
Today: Blue (v1.0) → live traffic
Green (v2.0) → idle, deployed
Switch: Blue (v1.0) → idle
Green (v2.0) → live traffic ← cut over via load balancer
Tomorrow: Green (v2.0) → live traffic
Blue (v3.0) → idle, deployed
```
Mechanism: a load balancer or DNS swap routes traffic. The "switch" is fast (seconds to minutes); rollback is the same swap in reverse.
What it gets right
- **Instant rollback.** If green is broken, switch back to blue. No deploy queue; no rebuild.
- **Pre-warmed environment.** Green has been running, caches are warm, JIT has compiled hot paths.
- **Testing in production-like conditions.** Smoke-test green before switching; the environment is identical to production.
- **Zero-downtime cutover.** Brief overlap window; both serve briefly; then full cutover.
What it costs
- **2× infrastructure** (briefly). Both environments run during deployment. For a stateful, expensive system, this is real money.
- **State coordination.** Both environments share the database; schema must be compatible with both versions. See expand-contract pattern.
- **Limited granularity.** Either everyone's on green or nobody is. No "10% canary, validate, expand."
- **Operational discipline.** Two environments to keep in sync; database migrations must work across both.
When blue-green wins
- **Releases require a fast, atomic, validated cutover.** Heavily regulated systems where staged rollout is hard to justify.
- **Workloads with significant warmup cost.** JVMs that take 10 minutes to JIT-compile to peak performance; ML inference services with cache warmup.
- **Stateful systems where canary is hard.** Two environments are cleaner than slicing traffic with shared state.
When canary wins
For most modern web applications, canary deployment beats blue-green:
- **Roll out to 1% of users, monitor, expand to 10%, monitor, expand to 100%.** Catches bad versions before they hit everyone.
- **Cheaper** — no full duplicate environment.
- **Reversible** — drain the canary if it's bad; users on the canary may have brief impact, but it's bounded.
- **Better metrics** — the canary's behaviour is observable separately from the main fleet.
Canary requires:
- Traffic-routing infrastructure (service mesh, load balancer with weighted targeting).
- Feature flags or version-aware code.
- Observability per version.
- Automated rollback triggers (error rate, latency).
Most modern teams use canary or some progressive delivery system (LaunchDarkly, Argo Rollouts, Flagger). Blue-green has become a niche.
Hybrid: blue-green for infrastructure, canary for code
A common shape:
- **Blue-green for infrastructure changes** (Kubernetes upgrades, database major versions, network changes). Two clusters; switch over; blue-green semantics for things you can't easily slice traffic against.
- **Canary for application releases**. Within a single environment, progressive rollout via service mesh or feature flags.
This combines the strengths. The infrastructure swap is rare and atomic; application changes are gradual and reversible.
Database considerations
Blue-green is hardest on databases. Both environments share the same database; schema changes affect both.
The discipline:
- **Migrations are forward-compatible.** Old code (still on blue) and new code (on green) both work against the new schema.
- **Use expand-contract.** Add new columns; backfill; switch traffic; later remove old. See [DatabaseMigrationStrategies]().
- **Don't deploy schema changes during the cutover window.** Migrate in advance; roll out the application that uses the new schema.
A naive blue-green where the database migrates during the switch produces broken state. Plan migrations to land before the application change.
Cutover patterns
The "switch" can be:
- **DNS swap.** Slow (DNS TTLs); some clients keep stale records. Use only with very low TTLs and acceptance that brief overlap is fine.
- **Load balancer reconfig.** Fast (seconds); precise. Most common.
- **Service mesh routing.** Same as LB but with more control.
- **Feature flag** controlling which environment to route to. Enables more nuanced rollback.
For Kubernetes specifically, blue-green is implemented via two Deployments and a Service whose selector switches. Tools (Argo Rollouts, Flagger) automate the switch and the validation.
Traffic-shifting strategies
Variations on the cutover:
- **All-at-once.** Switch 100% of traffic at once. Maximum risk.
- **Stepped.** Switch in increments (10%, 25%, 50%, 100%). Each step is a checkpoint.
- **Header-based.** Internal users hit green first; only after they validate, switch external traffic.
- **Geo-based.** Roll out by region. EU first; if good, US.
Stepped switching with automated rollback on bad metrics is the closest blue-green gets to canary's gradient.
Failure modes
- **Schema not migrated before switch.** Green starts; queries against new columns; columns don't exist. Outage.
- **Asymmetric warmup.** Blue had warm caches; green is cold; switching produces a latency spike. Pre-warm green before switching.
- **Sticky sessions.** Users mid-session on blue suddenly hit green; lose session state. Either drain blue (slow) or share session state (database, Redis).
- **Long-running connections / WebSockets.** Blue's open connections don't migrate. Drain time can be hours. Plan for it.
- **Queue / scheduled work.** A background worker on blue picked up a job; you switched to green; the work continues on blue. Coordinate with workers.
Tools
- **Argo Rollouts** — Kubernetes; handles blue-green and canary; integrates with metrics.
- **Flagger** — similar; tighter Istio / Linkerd integration.
- **AWS CodeDeploy** — supports blue-green for ECS, EKS, Lambda.
- **Cloudflare Pages / Vercel / Netlify** — built-in blue-green-like deployment for static / serverless.
For new Kubernetes deployments: pick Argo Rollouts or Flagger; both work; pick based on your service mesh.
What I'd actually recommend
For a typical modern team:
1. **Start with rolling deployments** — Kubernetes' default. Cheap, decent, good enough for most cases.
2. **Add canary for high-stakes services** — payment, auth, anything customer-facing. Argo Rollouts or Flagger.
3. **Reserve blue-green for cases where canary doesn't fit** — major infrastructure changes, environments where partial rollouts don't work.
Blue-green isn't dead; it's just not the default anymore.
Further reading
- [ContainerOrchestration]() — Kubernetes deployment primitives
- [DarkLaunchPatterns]() — release without traffic
- [CanaryDeployments]() — the gradient version
- [ChaosEngineering]() — testing the deployment doesn't break things