Concept

The Two Fundamental Axes of Performance

Latency: The Time to Complete One Operation

Latency measures the delay a single user or request experiences — the time from sending a request to receiving the response. Measured in milliseconds (ms) or microseconds (μs). High latency means users wait. In interactive systems, latency directly determines perceived quality.

Latency is typically measured at percentiles — p50 (median), p99 (99th percentile), p99.9. The p99 latency is often more operationally important than the median, because the worst 1% of requests are what users complain about and what SLAs are often written against.

Throughput: Operations Per Unit of Time

Throughput measures the total capacity of the system — how many operations it can sustain concurrently, measured in requests per second (RPS), messages per second, or MB/s. High throughput means the system can handle more total work. Throughput determines whether a system can serve 1 million users simultaneously.

The Fundamental Tension

Techniques that improve throughput often hurt latency, and vice versa. This tension is the central trade-off of performance engineering.

  • Batching increases throughput but hurts latency. Instead of processing each request immediately, batch multiple requests together and process them in one operation. The per-request overhead is amortized — total throughput rises. But individual requests must wait for the batch to fill before being processed — latency increases.
  • Dedicated resources minimize latency but hurt throughput. Giving each request its own dedicated thread, connection, or compute resource minimizes waiting — requests are served immediately. But dedicated resources have a fixed cost per request — total throughput (requests per dollar) decreases.

The key skill is knowing which axis your workload actually optimizes for — and not optimizing the wrong one.