Blue-Green Deployments

Blue-green is "run two parallel environments; switch traffic from one to the other." It's a deployment strategy with strong rollback guarantees — if the new version misbehaves, switch back, instantly.

The strategy has lost ground to canary / progressive delivery as the dominant pattern, but it's still the right answer in specific cases.

How it works

Two production environments — "blue" and "green" — both capable of serving real traffic. Only one is live at a time.

Today:    Blue (v1.0) → live traffic
          Green (v2.0) → idle, deployed

Switch:   Blue (v1.0) → idle  
          Green (v2.0) → live traffic   ← cut over via load balancer

Tomorrow: Green (v2.0) → live traffic
          Blue (v3.0) → idle, deployed

Mechanism: a load balancer or DNS swap routes traffic. The "switch" is fast (seconds to minutes); rollback is the same swap in reverse.

What it gets right

Instant rollback. If green is broken, switch back to blue. No deploy queue; no rebuild.
Pre-warmed environment. Green has been running, caches are warm, JIT has compiled hot paths.
Testing in production-like conditions. Smoke-test green before switching; the environment is identical to production.
Zero-downtime cutover. Brief overlap window; both serve briefly; then full cutover.

What it costs

2× infrastructure (briefly). Both environments run during deployment. For a stateful, expensive system, this is real money.
State coordination. Both environments share the database; schema must be compatible with both versions. See expand-contract pattern.
Limited granularity. Either everyone's on green or nobody is. No "10% canary, validate, expand."
Operational discipline. Two environments to keep in sync; database migrations must work across both.

When blue-green wins

Releases require a fast, atomic, validated cutover. Heavily regulated systems where staged rollout is hard to justify.
Workloads with significant warmup cost. JVMs that take 10 minutes to JIT-compile to peak performance; ML inference services with cache warmup.
Stateful systems where canary is hard. Two environments are cleaner than slicing traffic with shared state.

When canary wins

For most modern web applications, canary deployment beats blue-green:

Roll out to 1% of users, monitor, expand to 10%, monitor, expand to 100%. Catches bad versions before they hit everyone.
Cheaper — no full duplicate environment.
Reversible — drain the canary if it's bad; users on the canary may have brief impact, but it's bounded.
Better metrics — the canary's behaviour is observable separately from the main fleet.

Canary requires:

Traffic-routing infrastructure (service mesh, load balancer with weighted targeting).
Feature flags or version-aware code.
Observability per version.
Automated rollback triggers (error rate, latency).

Most modern teams use canary or some progressive delivery system (LaunchDarkly, Argo Rollouts, Flagger). Blue-green has become a niche.

Hybrid: blue-green for infrastructure, canary for code

A common shape:

Blue-green for infrastructure changes (Kubernetes upgrades, database major versions, network changes). Two clusters; switch over; blue-green semantics for things you can't easily slice traffic against.
Canary for application releases. Within a single environment, progressive rollout via service mesh or feature flags.

This combines the strengths. The infrastructure swap is rare and atomic; application changes are gradual and reversible.

Database considerations

Blue-green is hardest on databases. Both environments share the same database; schema changes affect both.

The discipline:

Migrations are forward-compatible. Old code (still on blue) and new code (on green) both work against the new schema.
Use expand-contract. Add new columns; backfill; switch traffic; later remove old. See DatabaseMigrationStrategies.
Don't deploy schema changes during the cutover window. Migrate in advance; roll out the application that uses the new schema.

A naive blue-green where the database migrates during the switch produces broken state. Plan migrations to land before the application change.

Cutover patterns

The "switch" can be:

DNS swap. Slow (DNS TTLs); some clients keep stale records. Use only with very low TTLs and acceptance that brief overlap is fine.
Load balancer reconfig. Fast (seconds); precise. Most common.
Service mesh routing. Same as LB but with more control.
Feature flag controlling which environment to route to. Enables more nuanced rollback.

For Kubernetes specifically, blue-green is implemented via two Deployments and a Service whose selector switches. Tools (Argo Rollouts, Flagger) automate the switch and the validation.

Traffic-shifting strategies

Variations on the cutover:

All-at-once. Switch 100% of traffic at once. Maximum risk.
Stepped. Switch in increments (10%, 25%, 50%, 100%). Each step is a checkpoint.
Header-based. Internal users hit green first; only after they validate, switch external traffic.
Geo-based. Roll out by region. EU first; if good, US.

Stepped switching with automated rollback on bad metrics is the closest blue-green gets to canary's gradient.

Failure modes

Schema not migrated before switch. Green starts; queries against new columns; columns don't exist. Outage.
Asymmetric warmup. Blue had warm caches; green is cold; switching produces a latency spike. Pre-warm green before switching.
Sticky sessions. Users mid-session on blue suddenly hit green; lose session state. Either drain blue (slow) or share session state (database, Redis).
Long-running connections / WebSockets. Blue's open connections don't migrate. Drain time can be hours. Plan for it.
Queue / scheduled work. A background worker on blue picked up a job; you switched to green; the work continues on blue. Coordinate with workers.

Tools

Argo Rollouts — Kubernetes; handles blue-green and canary; integrates with metrics.
Flagger — similar; tighter Istio / Linkerd integration.
AWS CodeDeploy — supports blue-green for ECS, EKS, Lambda.
Cloudflare Pages / Vercel / Netlify — built-in blue-green-like deployment for static / serverless.

For new Kubernetes deployments: pick Argo Rollouts or Flagger; both work; pick based on your service mesh.

For a typical modern team:

Start with rolling deployments — Kubernetes' default. Cheap, decent, good enough for most cases.
Add canary for high-stakes services — payment, auth, anything customer-facing. Argo Rollouts or Flagger.
Reserve blue-green for cases where canary doesn't fit — major infrastructure changes, environments where partial rollouts don't work.

Blue-green isn't dead; it's just not the default anymore.