Hacker News new | ask | show | jobs
by developer2 3325 days ago
This particular issue was posted on Quora, where anyone could pick it up and participate in what is essentially a denial of service attack (whether or not performed intentionally). It wasn't submitted as a private bug report to Google so they could fix the issue. It was spread in a public forum. I think it's fair for Google to politely ask "a few of your own tests to validate an issue you will submit as a bug report is fine, but please don't disclose to the public until we patch it."

When you operate at the scale of Google, everything is expected to be airtight; outliers should not be possible. It wouldn't surprise me if their monitoring systems are built without the ability to "massage" (ie: manipulate) statistics, as it is a terrible practice. I don't think a statistician who relies on ignoring outliers would last long working for Google. They're not doing their job if the only thing they care about is silencing warnings to make pretty graphs that falsely show everything is running smoothly. Their job is to work with the truth - not manufacture little white lies to appease management.

1 comments

Boss: Median latency is 100ms and 99.9th percentile latency is 1 second.

Nobody ever asks about that 0.1%...

When that 0.1% - or even 0.001% - are 5-60 second requests, you have a bomb waiting to go off. There really is a massive difference when you are operating at the scale of Google. If the median is 100ms, the maximum acceptable time - 100th percentile - is likely below 200ms. A three nines percentile that is 10x the median isn't a good thing at large scale. Perfect consistency is more important than statistics. A small scale service deployed on my-little-unused-tool.com that receives a few requests/minute is an entirely different ballgame.