Alice is impatient

This post dives into a common but often misunderstood discrepancy in system performance: why users perceive services as much slower or prone to longer outages than internal metrics suggest. Author Marc Brooker introduces us to 'Alice' and 'Alex,' who experience a service's mean request time or mean time to recovery (MTTR) as significantly longer than what the engineering team measures. The core of this issue lies in the 'inspection paradox,' a statistical concept where observing events over time inherently biases towards longer-duration events.

The fundamental disconnect arises because system metrics typically measure averages per request or per outage event, treating each instance equally.
Conversely, users (like Alice and Alex) experience time continuously, meaning they are disproportionately exposed to, and thus heavily weight, the longer requests or outages.
Technically, while the service might report a mean latency of 'X', the customer experiences a time-weighted average, which is typically much higher, often represented by the formula E[T] * E[T] / E[T] (where T is the duration).
A simulation demonstrates this effect: a service with a 30ms median and 600ms 99th percentile latency might report a mean of 254ms, but customers would experience an average of 410ms.
This paradox underscores why tail latency (e.g., p99 or p99.9) and long recovery times are far more critical to customer experience than overall averages, as they dominate user-perceived wait times.
Brooker also argues against 'trimmed measurements' for latency or recovery times, as they deliberately discard the critical tail data that most impacts user perception.

Ultimately, the article serves as a crucial reminder for engineers to shift their perspective from purely event-driven metrics to a time-weighted understanding of system performance, aligning more closely with actual user experience.

Alice is impatient

The Lowdown