Concept

The Circuit Breaker State Machine

The Problem Circuit Breakers Solve

When a downstream service becomes slow or unavailable, callers that keep sending requests accumulate waiting threads. Those threads hold resources — memory, connections, CPU time. As more requests arrive and all wait on the slow downstream, the caller's thread pool exhausts. The caller itself becomes unable to serve requests, including those unrelated to the failing downstream. The failure has cascaded upstream.

A circuit breaker detects this degradation early and blocks calls to the failing service, giving it time to recover while protecting the caller's resource budget.

The Three States

CLOSED (normal operation): All requests pass through to the downstream service. The circuit breaker monitors the failure rate (errors + timeouts) and tracks it against a configurable threshold.

OPEN (blocking): When the failure rate exceeds the threshold, the circuit breaker opens. All subsequent calls are blocked immediately — they don't even attempt the network hop. The caller receives an immediate error or a fallback response. The downstream gets breathing room to recover. A timeout countdown begins.

HALF-OPEN (probing): After the timeout expires, the circuit allows a limited number of probe requests through. If they succeed, the circuit closes and normal operation resumes. If they fail, the circuit reopens and the timeout resets.

Implementation Guidance

  • Set thresholds based on production data. The error rate threshold and timeout window should reflect real-world baselines — not arbitrary defaults. Too sensitive = false trips; too loose = cascades before the breaker fires.
  • Always provide a fallback. When the circuit is open, return something useful: a cached result, a default response, a degraded experience. Never fail silently.
  • Don't wrap everything. In-process function calls don't need circuit breakers. Only wrap external network calls — database queries, downstream HTTP services, message broker connections.
  • Trip on system faults, not business errors. A 404 Not Found or 400 Bad Request is an expected response — don't count it as a circuit-breaker failure. Trip on timeouts, connection refused, and 500/503 responses — signals that the downstream is struggling.
  • Use a library. Don't implement circuit breakers from scratch. Resilience4j (Java/Kotlin), Polly (.NET), or service mesh implementations (Istio, Linkerd) provide battle-tested implementations with rich configuration.