Menu

Chaos Engineering

Proactively test system resilience: chaos experiments, blast radius control, Netflix Chaos Monkey, Gremlin, and building a chaos engineering practice.

10 min read

From Reactive to Proactive Reliability

Traditional reliability engineering is reactive: wait for production incidents, learn from post-mortems, fix the root cause. Chaos engineering inverts this: deliberately introduce failure in a controlled way to discover weaknesses before they become real outages. The motto is: *'If it hurts, do it more often.'* A system that is never tested under failure is a system whose failure behavior is unknown.

ℹ️

Chaos Engineering vs Stress Testing

Stress testing pushes a system beyond its stated capacity (e.g., 10x normal traffic). Chaos engineering tests the system's resilience to specific failure modes at normal or slightly elevated load — instance crashes, network partitions, dependency failures, disk filling up. They are complementary, not competing.

The Chaos Engineering Process

Loading diagram...
Chaos engineering cycle: define steady state, form hypothesis, run experiment, analyze, fix.

Types of Chaos Experiments

Failure TypeExample ExperimentTests For
Instance/Pod terminationKill a random API pod every hourAuto-healing, replica recovery
Network latencyAdd 500ms latency to payment service callsTimeout handling, circuit breaking
Packet lossDrop 20% of packets between servicesRetry logic, graceful degradation
Disk fullFill the data volume to 100%Error handling, alerting
Dependency unavailabilityTake the cache (Redis) offlineFallback to database, error messages
CPU / memory stressPeg CPU at 95% on a database nodeAutoscaling, replication failover
DNS failureReturn NXDOMAIN for a dependent serviceConnection timeout handling

Netflix's Simian Army

Netflix pioneered chaos engineering and open-sourced a suite of tools called the Simian Army. Chaos Monkey terminates random EC2 instances during business hours, forcing engineers to design for instance loss. Chaos Gorilla simulates the failure of an entire AWS availability zone. Latency Monkey introduces artificial delays. Conformity Monkey checks instances against best-practice rules. Security Monkey finds security policy violations.

Netflix runs Chaos Monkey in production during business hours — when engineers are at their desks to respond. The philosophy: if you're going to experience failure anyway (and you will), better to experience it on your schedule with your best engineers available.

Tools: Gremlin and Chaos Mesh

ToolTypeKey Features
Chaos MonkeyOpen-source (Netflix)EC2 instance termination, Spinnaker integration
GremlinCommercial SaaSGUI, attack library, blast radius controls, GameDay planning
Chaos MeshOpen-source (CNCF)Kubernetes-native, pod/network/time chaos, web UI
LitmusChaosOpen-source (CNCF)Kubernetes-native, ChaosHub with pre-built experiments
PumbaOpen-sourceDocker container chaos: kill, pause, network

Prerequisites for Safe Chaos Engineering

  1. Strong observability — you need metrics, logs, and traces to define steady state and detect deviation
  2. Defined steady state — concrete, measurable success criteria (e.g., p99 latency < 300ms, error rate < 0.5%)
  3. Automated kill switch — ability to abort the experiment instantly if steady state is violated
  4. Runbooks — engineers know how to respond to the failure mode being tested
  5. Start small — begin in staging, limit blast radius (canary percentage or specific namespace), then graduate to production
  6. GameDays — scheduled chaos exercises where teams practice responding to failures together
📌

Example: Chaos Mesh network latency experiment

Inject 200ms latency on all traffic from the `checkout-service` to the `payment-service`. Hypothesis: the checkout service's 500ms timeout will trigger a circuit breaker, and users will see a graceful 'payment unavailable, try again' message rather than a hanging request. Expected: error rate stays below 2%, no 5xx responses escape to the client. Abort condition: error rate exceeds 5%.

💡

Interview Tip

If asked 'how would you improve system reliability beyond standard redundancy?' — describe chaos engineering. Key points: (1) define steady state with SLOs, (2) form a hypothesis about a specific failure mode, (3) limit blast radius (start in staging), (4) inject failure with automated rollback if steady state is violated, (5) fix weaknesses discovered. Mention that chaos engineering requires strong observability as a prerequisite — this shows architectural depth.

📝

Knowledge Check

5 questions

Test your understanding of this lesson. Score 70% or higher to complete.

Ask about this lesson

Ask anything about Chaos Engineering