Concept

Cascading Failures: How One Slow Dependency Kills Everything

The blast radius of a single slow dependency

In a distributed system, services call other services synchronously. The dangerous failure mode is not a dependency that returns errors quickly — it is one that becomes slow. A healthy service has a fixed pool of threads, connections, or goroutines. When a downstream dependency slows from 20 ms to 5 s, each in-flight request now holds its worker for 5 seconds instead of releasing it.

The pile-up

  • Workers stay blocked waiting on the slow call, so the pool drains.
  • New requests queue up; the queue grows without bound.
  • Latency for every endpoint on that service spikes — even ones that never touch the slow dependency — because they are starved of workers.
  • The service now looks unhealthy to its callers, which begin to pile up the same way. The failure propagates upward through the call graph.

This is a cascading failure: a localized slowdown turns into a system-wide outage. The root cause is almost always unbounded resource consumption while waiting. The retry-storm amplifier makes it worse: clients that time out and retry multiply the load on an already-struggling dependency, guaranteeing it never recovers.

The core insight

You cannot prevent dependencies from failing — networks partition, GC pauses happen, databases lock up. Resilience is about containing failure: fail fast, isolate resources, and degrade gracefully instead of collapsing. The patterns in this module — circuit breakers, timeouts, bulkheads, retries with backoff, load shedding, and fallbacks — are the standard toolkit interviewers expect you to reach for.