Concept

Monitoring vs Observability

Why this matters

A single server with one process is easy to debug — you read the log file. A system of dozens of microservices, queues, caches, and databases is not. When a user reports that checkout is slow, the failure could be in any of ten hops. Observability is the property that lets you ask arbitrary questions about your system's internal state from the outside, without shipping new code to add a print statement.

The distinction

Monitoring answers questions you already knew to ask: Is CPU above 80%? Is the error rate above 1%? You define dashboards and alerts in advance for known failure modes. Observability lets you investigate unknown failure modes after the fact: Why are requests from this one customer, on this one API version, hitting this one shard, timing out? You did not predict that question, but the data is rich enough to answer it.

Monitoring is a subset of what an observable system enables. The three telemetry signals that power observability — logs, metrics, and traces — are called the three pillars. They are complementary: metrics tell you something is wrong, traces tell you where, and logs tell you why.

The classic debugging loop

  1. Detect — a metric-based alert fires (latency or error budget burn).
  2. Triage — drill into metrics to find the affected service, region, or version.
  3. Localize — open a trace for a slow request to see which span is the bottleneck.
  4. Root-cause — read the structured logs for that span to see the exception or bad input.

An interview-grade answer always connects the pillars into this loop rather than listing them in isolation.