HN
Today

Alice is impatient

Marc Brooker illuminates the 'inspection paradox,' a statistical phenomenon explaining why users always perceive services as slower or more unreliable than system-reported averages. This fundamental concept for engineers clarifies the critical disconnect between event-based metrics and time-based user experience, offering a compelling explanation for why tail latency disproportionately impacts customer satisfaction. It's a crucial insight for anyone building or monitoring distributed systems.

18
Score
1
Comments
#2
Highest Rank
19h
on Front Page
First Seen
Jun 20, 10:00 PM
Last Seen
Jun 21, 4:00 PM
Rank Over Time
257891517222426272724232528302930

The Lowdown

This post dives into a common but often misunderstood discrepancy in system performance: why users perceive services as much slower or prone to longer outages than internal metrics suggest. Author Marc Brooker introduces us to 'Alice' and 'Alex,' who experience a service's mean request time or mean time to recovery (MTTR) as significantly longer than what the engineering team measures. The core of this issue lies in the 'inspection paradox,' a statistical concept where observing events over time inherently biases towards longer-duration events.

  • The fundamental disconnect arises because system metrics typically measure averages per request or per outage event, treating each instance equally.
  • Conversely, users (like Alice and Alex) experience time continuously, meaning they are disproportionately exposed to, and thus heavily weight, the longer requests or outages.
  • Technically, while the service might report a mean latency of 'X', the customer experiences a time-weighted average, which is typically much higher, often represented by the formula E[T] * E[T] / E[T] (where T is the duration).
  • A simulation demonstrates this effect: a service with a 30ms median and 600ms 99th percentile latency might report a mean of 254ms, but customers would experience an average of 410ms.
  • This paradox underscores why tail latency (e.g., p99 or p99.9) and long recovery times are far more critical to customer experience than overall averages, as they dominate user-perceived wait times.
  • Brooker also argues against 'trimmed measurements' for latency or recovery times, as they deliberately discard the critical tail data that most impacts user perception.

Ultimately, the article serves as a crucial reminder for engineers to shift their perspective from purely event-driven metrics to a time-weighted understanding of system performance, aligning more closely with actual user experience.