Canary Deployments: Risk Mitigation
A Canary deployment is a strategy where a new version of software is rolled out to a small subset of users before being promoted to the entire infrastructure.
1. Traffic Splitting Mechanisms
The core of a canary is the ability to route a precise percentage of traffic.
* **Layer 4 (IP-based):** Simple, but lacks granularity. Often used at the Load Balancer level (e.g., AWS Target Group weights).
* **Layer 7 (Request-based):** Superior for modern APIs. Routes based on headers, cookies, or user IDs.
* **Concrete Example (Istio):** Use a `VirtualService` to route 90% of traffic to the `stable` subset and 10% to the `canary` subset.
```yaml
http:
- route:
- destination: {host: my-svc, subset: stable}, weight: 90
- destination: {host: my-svc, subset: canary}, weight: 10
```
2. Automated Canary Analysis (ACA)
Manual verification of a canary is slow and error-prone. Use automated gates.
* **Metrics Comparison:** Compare the **P95 Latency** and **Error Rate** of the canary group against the stable group.
* **Tools:** Use **Flagger** or **Argo Rollouts**. They automatically increase the traffic weight (e.g., 5% $\to$ 10% $\to$ 50% $\to$ 100%) if metrics remain healthy.
* **Concrete Trigger:** *Abort the rollout and roll back immediately if the canary error rate exceeds the stable error rate by >2% for a 2-minute window.*
3. Data Compatibility (The Hard Part)
If Version 2 requires a database schema change, the canary (V2) and the stable pool (V1) must both function against the same DB.
* **Rule:** Database migrations must be **Additive and Backwards Compatible**. Never delete or rename columns until the rollout is 100% complete and the old version is decommissioned.
4. Sticky Canaries
To prevent a user from flipping between V1 and V2 (which can break UIs), use **Session Affinity**.
* **Implementation:** Set a `canary-version` cookie on the first request. The API Gateway or Load Balancer respects this cookie to ensure the user stays on the same version throughout their session.
---
**See Also:**
- [Auto Scaling Strategies](AutoScalingStrategies) — Handling the load shift.
- [Backwards Compatibility Strategies](BackwardsCompatibilityStrategies) — Managing schema drift.
- [Monitoring and Alerting](MonitoringAndAlerting) — The data source for ACA.