| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jldugger 835 days ago

Pretty much every dataset I work with as an SRE is full of these paradoxes. One classic published example comes from Google:

A network engineer took a trip to Indonesia or something (can't find the citation to confirm the exact tale), noticed the service was slow, and when asking around everyone said "that's how its always been." Basically the local cellular networks are slow and off island fiber connects are saturated. Back at the office they decide to attack the problem by optimizing payload sizes. Does the work, reducing download sizes by half, and ships it. Latency metrics? Average and p95 latency actually increased after shipping the work to production.

How does an objectively good change make things worse? Well, the service had improved for those customers so much that they used it a lot more. Even with the lighter demand on bandwidth the network latency to the datacenter was worse than typical US customers, so as more of these people realized the service sucked way less, they used it more and drove the numbers up.

I have tons of these examples where a data team looks at a particular slice of request telemetry, and comes to a wrong conclusion because they didn't model enough of the system, or controlled for the wrong (or too many) variables. The worst ones the cyclic finger pointing situations that Simpson's paradox can produce: App developers blaming a regression on the server side component while the server team blames the app team, often because the server and app release schedules accidentally aligned too well. In this case we have canary data to exonerate our side of the equation, but sometimes the problem lies in even deeper spaces, like app updates from an entirely different app.

3 comments

Tomte 835 days ago

But your example isn‘t a case of Simpson‘s Paradox (which is purely statistical), but Jevons Paradox (which is about human behaviour and economics).

jldugger 835 days ago

Good point! I'm just a humble Linux sysadmin dubbed "SRE" who slept through Stats for Engineers and now pays the price every week dealing with SWE eager to blame me for their mistakes.

roenxi 834 days ago

You were right; that was a case of Simpson's paradox. Every category experienced a latency boost but the overall statistic worsened. Jevon's paradox is what caused the induced demand, but when the new usage data was gathered the initial review was an example of Simpson's paradox.

Effect of the change -> Jevon's paradox.

Measurement of Jevon's paradox -> Simpson's paradox (in this case, that isn't a general rule).

The fact that the two are easily linked is one of the reasons the statistical paradox is so common in practice.

lern_too_spel 834 days ago

Latency improved for everyone, but overall average latency increased because usage increased faster in high latency areas. That's Simpson's Paradox. Simpson's Paradox doesn't care where the subpopulations you're measuring came from.

ploxiln 835 days ago

If I recall the youtube slow-internet optimisation case correction, I think it is an example of Simpson's paradox. They made it faster for countries with fast internet, and faster for countries with slow internet, and then the average performance across all users/countries was slower, because now the countries with slow internet used youtube much more than before.

lordgrenville 834 days ago

But the improvement induced the demand, which to my mind makes this different from Simpson's Paradox.

bryanrasmussen 834 days ago

I would say the improvement allowed the demand to be met, everybody wanted to use youtube, but few could.

Just like many people may want to eat a wide range of expensive tasty food, but have to make do with junk because it's what they can afford.

mFixman 834 days ago

It would be Simpsons' Paradox if Google services in Indonesia were initially slow because Indonesians tend to use YouTube more often than lighter services.

There wasn't an error in the conclusions of the initial measuremen. It was the solution that had problems.

lern_too_spel 834 days ago

Doesn't matter. That is not relevant to the paradox.

kgwgk 834 days ago

How does "Average and p95 latency actually increased after shipping the work to production. How does an objectively good change make things worse?" relate to Simpson's paradox again?

ploxiln 834 days ago

That's exactly it. After "shipping the work to production" (making it faster for everybody), the overall average and p95 got worse. Each sub-population experienced improvement: countries with fast internet got faster youtube, countries with slow internet got faster youtube. But the overall average and p95 got worse: overall average was slower youtube. Because now more users from the second sub-population bring the overall average speed down (or latency up). That's Simpson's paradox.

clemiclemen 834 days ago

This reminds me of a similar story with YouTube [1] where improving the page weight decreased the metrics because more people with lower end connections could access the page.

Metrics interpretation is as important as the metrics themselves!

[1]: https://blog.chriszacharias.com/page-weight-matters

jldugger 834 days ago

That may be exactly the story I was thinking of, or perhaps the original of a story I encountered on a GCP cloud post or something.

tetris11 834 days ago

isn't that the "One More Lane, I Promise!" meme

wastewastewaste 834 days ago

It is, but usually the meme misrepresents induced demand. While I don't like cars and we should focus on other infrastructure, adding a lane does help.

It does not reduce congestion, but it does now serve more people at this same current congestion level. And those people have come from somewhere. Sometimes from public transport, which isn't really good, but sometimes from some backwater road.

vidarh 834 days ago

The bigger problem with induced demand is that it's often poor ROI to add that lane where the demand is highest.

That is, imagine you have a big city. You can add capacity for 1m extra people to travel to the city centre, where there's lots of congestion. Or you find ways to induce demand around the other limits of town, even town current demand is low there.

Odds are you'll pick the first, because it's "obvious" and doesn't require much thinking to see it'd help. But we really ought to look at cost-benefit of the second option too, because repeatedly inducing demand in the centre keeps driving up the incremental cost of further improvements, along plenty of other undesirable second order effects.

otherme123 834 days ago

Adding lanes is like getting a bigger cache with the same throughput.

It's obvious at the supermarket: what goes faster, a single cashier processing four short lanes of 10 people with round robin, or two cashiers processing a single lane with 40 people?

Is the city center able to process 1m extra people? If not, it doesn't matter how many lanes you build.

vidarh 834 days ago

Well you often can make it able to "process" 1m extra people: You can build overpasses, and tunnels, and taller buildings. But the cost-per-extra-person will tend to go up accordingly, to the point where you could spend an extraordinary amount attracting people out of the centre.

E.g. London's "Crossrail" / Elizabeth line cost $24 billion. Granted, it also allows some people to go through London faster, but I can't help to wonder what that money could've done if applied to attract businesses out of the centre instead. E.g. upgrading links between towns on the outskirts, upgrading town centres, and generally try to make it more attractive for businesses to be located further out.

Given the extraordinary costs it takes to do large infrastructure projects in London, I'd be very surprised if you couldn't get a higher return on investment that way, or by investing similar sums elsewhere in the UK entirely.

lmz 834 days ago

Until more people choose to live further away because the commute is now tolerable with the extra lane (and it's cheaper), and then you're back to square one.