Canary Release
Gradually roll out changes to a subset of users: traffic splitting, automated rollback triggers, metrics-based promotion, and A/B testing integration.
What Is a Canary Release?
A canary release gradually routes a small percentage of production traffic to the new version while the majority of users continue on the stable version. The name comes from the 'canary in a coal mine' metaphor — the canary detects danger early so miners can escape. In deployment terms, if the new version misbehaves under real traffic, only a small fraction of users are affected before the system automatically rolls back.
Unlike blue-green (which is a binary all-or-nothing switch), canary is a progressive traffic shift: 1% → 5% → 25% → 50% → 100%. At each stage, metrics are evaluated. If they stay within bounds, the deployment advances. If they degrade, the canary is automatically killed and traffic returns to 100% on the stable version.
Traffic Splitting Strategies
There are several ways to route a percentage of traffic to the canary:
| Strategy | How It Works | Consistency | Tool Examples |
|---|---|---|---|
| Random weight | Load balancer sends X% of requests to canary | None — same user may hit both versions | AWS ALB weighted target groups, Nginx upstream |
| User cohort (sticky) | Hash user ID to always route same user to same version | High — consistent user experience | Istio VirtualService, Envoy |
| Header-based | Route requests with specific header to canary | Manual — only users/testers with the header | Nginx, API Gateway, Feature flag |
| Geography / segment | Route users in a specific region or segment | High within segment | CloudFront, Akamai, Istio |
Sticky Sessions for Canary
For user-facing features, use consistent hashing on the user ID to ensure the same user always hits the same version during the canary window. Mixing versions mid-session creates a confusing UX and makes bug reports harder to diagnose.
Automated Promotion and Rollback
The power of modern canary tooling is automated, metric-driven promotion. You define success criteria up front — typically comparing the canary's metrics against a baseline sample of stable traffic over the same time window. Tools like Flagger (Kubernetes) and AWS CodeDeploy with CloudWatch alarms implement this automatically.
- Error rate: Canary HTTP 5xx rate must not exceed stable rate by more than 1%
- Latency: Canary p99 latency must not exceed stable p99 by more than 50ms
- Business metric: Canary checkout conversion must not drop more than 2% vs baseline
- Custom metrics: Any metric from your observability stack (Datadog, Prometheus) can be a gate
# Flagger Canary resource (Kubernetes / Istio)
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: payment-service
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-service
progressDeadlineSeconds: 3600
service:
port: 8080
analysis:
interval: 2m # evaluate every 2 minutes
threshold: 5 # max 5 failed checks before rollback
maxWeight: 50 # never exceed 50% canary traffic
stepWeight: 10 # increase by 10% each interval
metrics:
- name: request-success-rate
thresholdRange:
min: 99 # must be >= 99%
interval: 1m
- name: request-duration
thresholdRange:
max: 500 # p99 must be <= 500ms
interval: 1mCanary vs A/B Testing
Canary release and A/B testing look similar — both route traffic to different versions — but they serve different purposes:
| Dimension | Canary Release | A/B Testing |
|---|---|---|
| Goal | Validate reliability/stability of new code | Measure user behavior and business metrics |
| Decision driver | Technical metrics (errors, latency) | Business metrics (conversion, engagement) |
| Duration | Hours to days | Days to weeks |
| Traffic split | Small % initially (1–5%) | Often 50/50 for statistical significance |
| Rollback trigger | Automatic (metric threshold) | Manual (business decision) |
| Team | Engineering / SRE | Product / Data Science |
In practice, teams often combine canary and A/B: the canary gates on technical metrics first. Once the canary is at 100%, a separate A/B experiment measures business impact of the new feature's behavior.
Kubernetes Implementation
On Kubernetes, a native (without a service mesh) canary is done by running two `Deployment` objects with different replica counts and a shared `Service`. If the stable deployment has 9 replicas and the canary has 1, roughly 10% of traffic goes to the canary. The limitation is you can only control traffic percentage in multiples of `1/total_replicas`.
With Istio or Linkerd service meshes, you get precise percentage-based routing at the service mesh level independent of replica counts — a canary with 1 replica can receive exactly 5% of traffic while 19 stable replicas receive 95%. This is the preferred approach in production.
Canary and Database Migrations
Like blue-green, canary deployments share a database between versions. During a canary, both v1.0 and v2.0 are actively writing to the same database simultaneously. Any schema change must be fully backward-compatible with v1.0. Use the expand-contract pattern and never drop columns or change column types while a canary is active.
Interview Tip
When asked 'how do you deploy safely?', canary is a great answer. Demonstrate depth by explaining the metric-based promotion gates — don't just say 'gradually increase traffic.' Mention the comparison against a baseline (not just absolute thresholds), and note that Flagger/Argo Rollouts automate this in Kubernetes. Bonus points: mention that canary and feature flags solve different problems and are often used together.