Users Do Not Experience Averages
Many years ago I was leading architecture at a major car rental company. We had one application that was running on some incredibly powerful (and expensive) hardware and the performance was truly awful. When I asked the team what the CPU was doing during the load test, I was told — confidently — that the CPU was only hitting 60%. The implicit message was: the machine has headroom, so the problem must be somewhere else.
I asked for the second-by-second data instead of the five-minute average. What I saw was not a CPU flat-lining at 60%. What I saw was a CPU slamming from 100% to nothing, to 100%, to nothing, in rapid cycles — over and over. On average, yes, it looked like 60%. But no user was ever served by “an average CPU.” Every user’s request landed on one of those 100% spikes or one of those dead intervals, and the whole platform was either churning like crazy or getting blocked on database I/O.
The average said “everything is fine.” The distribution said “nothing is fine.”
(let’s leave aside the fact that our wonderful consultant partners had told our CIO that the future was XML and XSLT so the team had build an interactive application using a data orchestration framework 🤦🏻♂️).
Users do not experience averages. Every user experiences a single point in a distribution — a single page load, a single onboarding attempt, a single transaction — and it is the shape of the distribution around that point, not the mean, that determines whether any given user succeeds or fails. A dashboard that reports a mean without a standard deviation, a percentile, or a trend is lying to your board about the real user experience.
TL;DR: I was trained as a Six Sigma black belt at GE, I read Deming , I’m obsessed with Lean and the Theory of Constraints, and I’ve spent 30 years running engineering and product organisations. The single most common and most expensive analytical mistake I see — in board decks, in engineering dashboards, in OKR reviews, in private-equity diligence — is the confusion between the average of a thing and the experience of the thing. The fix is not more data. The fix is a vocabulary — standard deviations, tails, control charts, wing-to-wing outcomes — that most product and analyst teams were simply never taught. That’s the real content of this article.
What users actually experience
There is no such person as “the average user.” There is only the user who just hit your site at three in the morning, on a cold-cache region, behind a flaky mobile network. Her onboarding either worked or it didn’t. Her page either loaded in 1.2 seconds or 11 seconds. Her support ticket either got a two-minute response or a two-day one.
Your dashboard can say “average page load: 1.8s” in perfectly good faith. But no user is served by 1.8 seconds. One user got 0.6 seconds and another got 9. The first is delighted and the second is gone — probably forever. Your CFO, staring at 1.8 seconds on a slide, sees a system that looks healthy. The customer who rage-churned this morning knows otherwise.
This isn’t a pedantic point about statistics. It’s a fundamental claim about how software meets reality. Every interaction is a single sample from a distribution, and every customer lives at one specific point in that distribution — the point that happened to them. The average is an artefact of aggregation. It doesn’t exist in the world.
The board-deck fallacy
Here is the pattern I see constantly. A team presents the board with “our metric has moved from 59% to 62% — great news.” The whole room nods. The CEO smiles. The board is satisfied.
Nobody asks the two questions that matter. First: what was it three months ago, and six, and nine? Second: how big is the normal month-to-month variation in this metric — what is its standard deviation?
Without the first answer you have a point in time, which tells you nothing. A 62% with a rising six-month trend is a very different story from a 62% that followed a 71% and an 83%. Without the second answer you have a number without a sense of scale. If this metric typically swings by ±5 percentage points month to month as a matter of normal system noise, then 59% to 62% is not progress. It’s the metric doing what it always does. You have not won; you have rolled a pair of dice.
I don’t allow people to present board decks with point-in-time numbers. It’s lazy and unhelpful. The rule is: if it’s a metric, I want to see the trend; and where it makes sense, I want to see the standard deviation on the chart too. If there’s one thing that ought to be taught in infant school, it’s how to read a control chart. A straight line of dots scattered inside a ±1 standard-deviation band is not a trend. It’s the system behaving exactly as the system behaves. Celebrating it is worse than doing nothing — it anaesthetises the organisation to its own lack of progress.
Where this way of thinking comes from
Three formations shape how I look at metrics, and they’re worth naming because the PM world has largely lost touch with all of them.
The first is Six Sigma, which I learned at GE. The popular caricature of Six Sigma is DMAIC rituals and coloured belts. That misses the point. The centre of the GE training was wing-to-wing measurement — a phrase borrowed from the jet engine business. You do not measure the quality of the fan blade, the quality of the combustion chamber, or the quality of the assembly step in isolation. You measure the entire customer outcome, end to end — the engine under the aircraft’s wing, across its service life, in the hands of the airline. Every upstream metric is subordinate to that end-to-end outcome. Six Sigma, properly understood, is an outcomes methodology , not a process-quality methodology, and it is directly compatible with everything this blog argues about outcome-based roadmaps .
The second is W. Edwards Deming. Deming’s most famous and most important claim is that 95% of the variation in worker performance is caused by the system, not by the worker. Blame the system, not the person. It is a thesis about humility: when a team’s numbers look bad, the overwhelming odds are that the machine around the team is producing the bad numbers, not the people inside it. I’d add one refinement: 95% was Deming writing about manufacturing-floor work, where the system really does dominate. For knowledge work — product managers, analysts, senior engineers, where the individual has real agency over her own tools, workflows and queue management — I think it’s closer to 70% system, 30% individual. Still system-dominant, but not as overwhelming as Deming’s original. Either way, when a metric goes sideways, look at the system first.
The third is Theory of Constraints and Lean. I’ve written elsewhere that in 30+ years of CTO, CTPO and NED work I have not yet found a team-throughput problem where ToC + Lean wasn’t the right lens. For the purposes of this article, the relevant Lean idea is that a dashboard showing aggregate numbers without variation is a form of waste — it consumes attention without conveying information, and it produces decisions that are locally optimised for the aggregate and globally wrong for the customer.
These are not methodologies I recommend. They are the formation that shaped how I ask questions. The reason I mention them is that the vocabulary they provide — distributions, percentiles, control charts, wing-to-wing, common-cause, special-cause — is the vocabulary that most PM and analyst teams have never been taught, and it is genuinely the vocabulary the conversation needs. Much of what looks like a product-management problem in a given organisation turns out, when you scratch at it, to be a grain-of-the-system problem — the business has been shaped over years in ways that make certain conversations easy and others impossible, and this is one of the conversations most businesses have unintentionally made impossible.
Common-cause vs special-cause variation — the idea nobody was taught
Common-cause variation is the normal, expected variation a stable system produces. Special-cause variation is a signal that something has changed in the system itself. The single most consequential analytical skill a product manager can develop is the ability to tell them apart. Teams that can’t tell them apart waste enormous amounts of effort reacting to normal noise as if it were a crisis, and they miss genuine shifts in the system while dismissing them as noise.
Concrete examples make the distinction real.
Common cause: your conversion rate oscillates between 17% and 21% month to month with no obvious pattern. This is the system doing what systems do. Running a retrospective on “why did we only get 18% last week” is a waste of time. The answer is: the same reasons you get 20% in other weeks. Nothing changed.
Special cause: your conversion rate has sat between 17% and 21% for two years and then drops to 11% for three consecutive weeks. Something has changed. The system has had a shock — a bug, a competitor launch, a pricing change, a regional outage, a new onboarding experiment that failed. This is the moment to mobilise.
The reason this matters isn’t academic. It’s that organisations routinely do the exact opposite of what the variation structure demands. They hold post-mortems on common-cause wiggles because they happen to be down. They dismiss genuine special-cause signals because “we’ve seen this before.” They fire people whose metrics are bad because of system effects the individual couldn’t influence. Deming’s 95% — or my 70% — is the antidote.
The practical tool is a control chart: a time series with the mean drawn in and ±1 and ±2 standard deviation bands shaded around it. A new data point inside the bands is common-cause, and you leave it alone. A new data point outside the bands — or seven consecutive points on one side of the mean, or six consecutive points trending in one direction — is a signal, and you investigate. That’s the whole discipline. It fits on a business card.
The real cost: compound mediocrity
Here is the thing that makes this more than a pedantic complaint. It’s what happens when a team and its board live inside this vocabulary gap for years on end.
The organisation doesn’t blow up. There’s no scandal. There’s no incident report. What happens instead is that every quarter the board gets a deck that reports averages without distributions and point-in-time numbers without trends. Every quarter the leadership team makes decisions off those slides — which product to invest more in, which team to expand, which roadmap item to ship, which pricing move to make. Some of those decisions are genuinely good. Many of them are 1% sub-optimal — not catastrophically wrong, just slightly worse than they could have been if the data had been read properly. Meanwhile the organisation looks productive because the feature factory keeps shipping; the deck says 62% hooray and the roadmap is green. Nobody is lying. Everyone is a little bit wrong.
A quarter of 1% sub-optimal decisions is nothing. Compounded over three years, you have an organisation that has invested in the wrong products, expanded the wrong teams, and shipped the wrong roadmap, and is maybe 30% worse than its peers — and nobody can point to when it happened because there was never a crisis. It is a slow, silent drift into mediocrity, and the metric that would have detected it was hidden by the very dashboards the board was using to govern.
Einstein supposedly called compound interest the greatest force in the universe. My personal motto is Custodi Detractione — keep chipping away — because compound improvement, pursued deliberately over years, is genuinely how great businesses are built. The mirror image is also true. Compound mediocrity is real, and the only thing standing between a team and one outcome or the other is the ability to tell real progress from statistical noise. That ability is a training problem. That’s what this article is about.
Two stories that show the stakes
A recent example from my diligence work that still makes me wince. I was reviewing the board pack of a business with a very large estate of production systems. One slide reported the percentage of security controls implemented across the estate, broken out by country: “Germany 78%, UK 82%, US 76%.” The narrative ran: “we’ve moved from 24% unimplemented to 22% unimplemented — good progress.”
Every control was weighted equally. A critical-severity control and a trivial one counted the same. Nowhere in the deck was there a breakdown by severity. The 2% that had been closed was entirely the easy, low-severity stuff. The high-severity gaps — the ones that would actually cause a breach — were untouched. The customer whose account gets compromised does not care that you’re at 22%. He experiences the one specific control gap that hit him, not the weighted average of all the controls across all the countries. That slide, in that form, made the board complacent about an existential risk. An unweighted aggregate was presented as progress, and the board believed it.
Compare that with 10x Banking, which builds core banking platforms for Tier 1 banks. The performance culture there was calibrated entirely differently. All latency numbers were presented internally at p99, and the team cared about p100 — the single worst transaction in the entire distribution. The reason was simple. At Tier 1 bank scale, even a 1-in-100 transaction failure is thousands of customers a day, and every one of them is somebody’s real money. A mean would have been useless. A p95 would have been a lie. The team lived in the tail, because the customer lives in the tail. That’s the culture every product organisation that handles real customer outcomes should aspire to.
The playbook — three layers
Different readers have different leverage. Here is what to do on Monday morning, in three layers.
1. If you’re an analyst or a product manager preparing a deck
The rule is simple and non-negotiable. Never present a point-in-time number. If you’re reporting a metric, you report it with a trend. Eight data points minimum — enough for the eye to see the shape. Where the metric has a natural distribution — latency, conversion, response time, engagement, resolution time — add a standard deviation to the chart, or switch to a percentile (p95 at minimum; p99 if your customer outcomes live in the tail, as at 10x Banking).
When you present the trend, draw the mean line in, shade a ±1 SD band around it, and let the audience see for themselves whether the new point is inside the band (noise) or outside it (signal). You will look more competent, your boss will look more competent, and the board will make better decisions. This is a matter of effort, not statistical sophistication — every dashboarding tool in current use can do this with two clicks.
2. If you’re a CPO or a senior leader who controls the culture
Two Monday-morning policy changes. First: ban point-in-time numbers from every deck that crosses your desk. Send it back. Not in a humiliating way — in an educational way. “I need the trend. I need to see what was it last quarter, and the quarter before, and the quarter before that. If that’s not in the deck, I can’t make a decision off it.”
Second: demand a burn-down or burn-up on any metric that’s meant to be trending in a direction. The burn-down is the most powerful visual instrument for making compound progress visible — and for exposing the absence of it. Years ago I joined a company whose alerts dashboard had 50 pages of alerts. Fifty pages. It was impossible to tell a real problem from the background radiation. I published a daily burn-down of the alert count for three months, in front of the whole team. We chipped away at the problem every day, visibly, and we got to a good place. The burn-down wasn’t a measurement tool. It was a cultural tool. It made progress real. It made the absence of progress equally real. You can’t have grown-up conversations about improvement without an instrument that lets the whole team watch the improvement happen. The same discipline is what makes a firebreak sprint work — the whole squad staring at a burn-down of operational debt for two weeks until it’s gone.
3. If you’re a NED or a board member
You’re staring at a deck you didn’t write, under time pressure, in a room where the CEO is watching. You need three questions that pop the balloon without humiliating the person who built the slide.
- “What’s the trend?” — if the slide shows a point in time, this is the polite version of “this deck is not acceptable.”
- “What’s the normal variation in this metric?” — if the presenter can’t answer, you’ve learned what you needed to learn.
- “Has anything actually changed, or is this inside the band?” — the direct ask for common-cause vs special-cause judgement.
Ask them in that order. Don’t press. If the person presenting doesn’t have the answer, leave it. The cultural fix isn’t humiliating the analyst who was following the deck template that the executive team set. The cultural fix is the quiet conversation afterwards, with the CEO’s permission, in which you explain to the analyst — or to the whole analyst team — what you actually want to see in a board pack and why. I’ve done this more times than I can count, and it’s the single highest-leverage intervention available to a board member. You change the template, the template changes the deck, the deck changes the decisions, the decisions change the business. Leave the public theatre alone.
Distribution-aware Key Results
One practical implication for anyone working in an OKR framework — and for anyone trying to sort out the OKR vs KPI question in their own organisation. A Key Result expressed as a mean is almost always wrong. “Reduce average page load to under 2 seconds” is a KR that can be hit with a distribution that still has 15% of requests taking ten seconds — and those 15% are the users who are leaving. A better KR is expressed in distribution-aware terms: “p95 page load < 2 seconds AND standard deviation of page load < 400ms.” Now the KR forces the team to address both where the typical user lives and how wide the tail is.
The same thinking applies across the board. “Average support response time” should become “p95 response time” or, for a high-stakes service, “p99.” “Average NPS” is almost meaningless; “percentage of NPS respondents scoring 9 or 10” is a distribution-aware version that tells you something. “Average onboarding completion time” should become “p80 onboarding completion time” so that you’re driving the long tail of struggling users, not polishing the easy cases. This is equally true at the North Star metric level, across the pirate metrics funnel, and in every SaaS metric your board looks at — a distribution with a named tail beats an aggregate every time.
This isn’t statistical ceremony. It’s matching the KR to the customer reality. A customer has a single experience. Your KR should be expressed in a form that reflects that. For more on this kind of Key Result design , on choosing the right type of metric for a given Objective, on measuring discovery success , and on concrete OKR examples that show distribution-aware thinking in practice, the related articles on this blog go much deeper.
Frequently asked questions
Why are averages misleading?
Averages are misleading because no user ever experiences an average. Each user experiences a single point on a distribution — a single page load, a single onboarding attempt, a single transaction. The average hides the variation around that point, and the variation is what determines whether any given user succeeds or fails. An average also hides extreme values that a normal distribution assumption would smooth out, and most real business metrics (income, usage, response time) are not normally distributed. Reporting an average without a standard deviation or a percentile is a form of information loss that systematically misrepresents the real user experience.
What is the difference between common cause and special cause variation?
Common-cause variation is the normal, expected variation that a stable system produces — the background noise a process generates even when nothing has changed. Special-cause variation is a signal that something has actually changed in the system, such as a bug, a competitor launch, a process change, or an outage. The practical importance of the distinction is that common-cause variation should be left alone (investigating it wastes effort) while special-cause variation should be acted on. The tool for telling them apart is a control chart: a time series with the mean and ±1 / ±2 standard deviation bands drawn in.
What is the flaw of averages?
The “flaw of averages” is the general name for a family of mistakes that arise from treating an aggregate statistic as a description of an individual experience. The flaw is not arithmetic — the average is correctly computed. The flaw is semantic: the average describes a population, but the decision being made usually concerns individuals (a specific customer, a specific request, a specific quarter). Famous illustrations include the statistician who drowned crossing a river whose average depth was three feet, and the airliner that crashed because the “average” pilot fit the cockpit but no actual pilot did.
Why is tail latency (p95 or p99) more important than average latency?
Tail latency matters because users experience individual requests, not averages. In most real systems, a small percentage of requests take much longer than the median, and that small percentage determines whether customers churn, whether transactions fail, and whether downstream services time out. At scale, 1% of requests can mean thousands of users a day with a genuinely broken experience. An average latency number will smooth over that tail entirely and make the system look healthy when a measurable slice of customers are leaving. p95 or p99 forces you to look directly at the slowest experiences your users actually had.
How should I present data to a board?
Three rules. First, never present a point-in-time number — always show a trend, with at least eight data points so the shape is visible. Second, where the metric has natural variation, draw the mean and a ±1 standard deviation band on the chart so the audience can see whether the latest point is inside the band (noise) or outside it (signal). Third, if the metric is meant to be moving in a direction, present it as a burn-down or burn-up so compound progress is visually unambiguous. A board deck without trends, distributions or compound progress is a deck that cannot support a grown-up conversation about where the business actually is.
What does Deming’s 95% rule mean?
W. Edwards Deming claimed that 95% of the variation in worker performance is caused by the system the worker operates inside, not by the worker herself. In other words: when a team’s numbers look bad, the overwhelming odds are that the organisation, tools, workflows and incentives around the team are producing the bad numbers, not the people. For manufacturing-floor work, Deming’s 95% is widely accepted. For knowledge work — product managers, analysts, senior engineers — where the individual has more agency over her own tools and queue management, I’d place the figure closer to 70% system, 30% individual. Either way, when a metric goes wrong, investigate the system first.
Conclusion
Every user experiences a single point in a distribution — a single page load, a single onboarding attempt, a single transaction. The average is an artefact of aggregation; it is not a thing that happens to anyone. A team that understands this — that uses standard deviations, control charts, percentile thinking, and the vocabulary of common-cause and special-cause variation — can tell real progress from statistical noise, hold grown-up conversations about where the business actually is, and compound small improvements into genuine advantage over years. A team that doesn’t can look busy and productive for a very long time while quietly drifting into compound mediocrity.
Einstein called compound interest the greatest force in the universe. He was right, and the force runs in both directions. Custodi Detractione — keep chipping away — is a prescription, not a description. You only get to chip away at something if you can see whether you’re actually chipping. That’s what the distribution, the standard deviation, the burn-down and the control chart are for. They’re not statistical ceremony. They’re the instruments of honest progress. And if there’s one thing that ought to be taught in infant school, it’s how to read one.