Concept

The RAM Metrics & Design Principles

Everything Fails All the Time

The foundational mindset shift in reliability engineering: production failure is not an anomalous event to be prevented — it is the normal steady state to be designed around. Hardware fails. Networks partition. Deployments introduce bugs. Dependencies become unavailable. The question is not "how do we prevent failure?" but "how do we build a system that continues working when components fail?"

The Three RAM Metrics

Reliability — MTBF (Mean Time Between Failures): How often does the system fail? A higher MTBF means the system fails less frequently. Improving MTBF requires better engineering: more robust code, more redundant hardware, better monitoring to catch problems before they become failures.

Availability — MTBF / (MTBF + MTTR): The fraction of time the system is operational. Availability = MTBF / (MTBF + MTTR). A system with MTBF of 1 week and MTTR of 1 minute has higher availability than one with MTBF of 1 year and MTTR of 6 hours. Key insight: reducing MTTR is often more impactful than reducing MTBF.

Maintainability — MTTR (Mean Time To Recovery): How quickly can the system recover after a failure? MTTR is dominated by detection time + diagnosis time + repair time. Automated failover, robust alerting, and runbooks dramatically reduce MTTR.

Design Principles

  • Eliminate SPOFs through redundancy: Identify every single point of failure in the system — single database primary with no replica, single load balancer, single DNS provider — and add redundancy at each.
  • Enable graceful degradation: Classify all dependencies as critical (system cannot function without it) vs non-critical (system continues in degraded mode without it). Non-critical dependency failures must never propagate as hard errors to users.
  • Monitor proactively: Monitor p99/p99.9 latency, error rates, queue depths, and database connection pool saturation — not just CPU and memory. Set alerts on leading indicators, not just symptoms.
  • Automate failover: Kubernetes pod restarts, database leader election (Patroni, AWS RDS Multi-AZ), load balancer health checks — all should trigger automatic remediation without human intervention.
  • Implement circuit breakers and bounded retries: Prevent cascading failures by stopping calls to unhealthy downstream services and limiting the retry storm that follows a failure.