Release Engineering

Release engineering is the discipline of taking code from main and getting it to production safely. The "code is on main" to "users can use it" gap involves real engineering: artifacts, signing, deployment strategies, rollback, validation.

This page covers the practices that distinguish good release engineering.

Build artifacts

The output of CI: a deployable thing.

What an artifact is

Container image: Docker tagged with build SHA
JAR/WAR: Java deployment artifact
Native binary: Go, Rust output
Static files: SPA build output
Helm chart: Kubernetes deployment

The artifact is what gets deployed. Same artifact through dev → staging → prod.

Versioning

Artifacts tagged with:

Git SHA (for traceability)
Semantic version (for human consumption)
Build number (for ordering)

Common: v1.2.3-build.456-abc1234

Storage

Artifacts in a registry: Docker Hub, ECR, Artifactory, GitHub Packages. Versioned; immutable; auditable.

Signing and provenance

For supply-chain security:

Sign artifacts

Cosign, Sigstore, or vendor-specific. The artifact has a signature proving it came from your CI.

Provenance

Records of how the artifact was built: from what source; with what dependencies; on what infrastructure. SLSA (Supply-chain Levels for Software Artifacts) provides a framework.

For sensitive software, provenance is required. For others, it's emerging best practice.

Deployment strategies

Rolling

Replace instances one at a time. Standard Kubernetes default.

Pros: simple; minimal extra resources. Cons: brief co-existence of versions; rollback is another rolling deploy.

Blue-green

Two identical environments. Switch traffic from blue (current) to green (new).

Pros: instant cutover; instant rollback. Cons: 2x resources during deploy; two environments to maintain.

Canary

Deploy to small subset; verify; expand.

Pros: limits blast radius; auto-rollback possible. Cons: more complex orchestration.

Feature-flag-driven

Deploy code disabled; toggle via flag.

Pros: total control over release timing; instant rollback via flag. Cons: requires flag infrastructure; code includes both paths.

For most modern deploys, canary + feature flags is the gold standard.

Rollback

When deploys go wrong, rollback fast.

Rollback as a button

The same automation that deploys should rollback. Click button; previous version restored.

Rollback should not require fixing

Don't require a forward fix during an incident. Rollback first; debug later.

Rollback should not lose data

Schema migrations need backwards compatibility. New code should work with old schema; old code should work with new schema (during transition).

The expand-and-contract pattern: add new schema columns (expand), deploy code that uses both, remove old usage (contract). Allows rollback at every step.

Forward fix vs. rollback

For minor issues: forward fix. For major issues: rollback first. The decision criterion: time to safety.

Specific patterns

Smoke tests post-deploy

After deploy, run automated checks: critical endpoints respond; key dependencies reachable. If they fail, auto-rollback.

Canary analysis

Compare canary metrics to baseline. If error rate or latency is worse on canary, auto-rollback.

Database migrations separately

Don't deploy schema changes with code changes. Deploy migrations first; verify; deploy code that uses them. Each step is reversible.

Environment promotion

Same artifact through dev → staging → prod. No rebuilds between environments.

Deployment windows

For high-risk changes, deploy during business hours when on-call is fully staffed. Don't deploy Friday afternoon.

For mature CD, this matters less; for systems with manual response, it matters a lot.

Specific environments

Development

Continuous deployment from main. Engineers see their changes immediately.

Staging

Production-like environment for final verification. Should mirror production closely.

Production

The actual user-facing environment. Deploy from artifacts that passed staging.

Operational practices

Release notes

Every production release has notes: what changed, who approved, links to PRs/tickets. Useful when investigating issues.

Change management

For regulated environments: change requests, approval workflows, change windows. Heavy but required for compliance.

For unregulated: lighter weight; auto-deploy from main with audit trail.

Deploy frequency tracking

DORA metrics: deploy frequency, lead time, change failure rate, MTTR. Track them; improve them.

On-call awareness

Who's on call during this deploy? They should know what's deploying and have rollback access.

Security in releases

Signed artifacts: covered above
Secrets management: secrets injected at deploy, not baked in
Vulnerability scanning: in CI; block deploys with known CVEs
SBOM: Software Bill of Materials; what dependencies are in this build

Common failure patterns

Manual deployment. Slow; error-prone; not reproducible.
No rollback capability. Disasters compound.
Schema migrations with code changes. Can't rollback.
Deploys not tied to specific artifacts. "What's actually running?" is unanswerable.
No post-deploy validation. Bad deploys reach users.
Deploys after-hours. When something goes wrong, no one's awake.