Hacker News new | ask | show | jobs
by solidasparagus 2181 days ago
> Assuming a Poisson rate

That feels like a mighty big assumption. Probably big enough that trying to calculate the probability is more misleading than enlightening.

1 comments

As I mentioned, my comment is meant as an exercise. If we were to take the numbers more seriously, due diligence is necessary. That said, if we assume that one incident does not affect the other, then the Poisson nature falls out as a natural consequence of that independence and the assumption of a constant rate (our Null hypothesis).

As long as the incidents are spaced out enough, that the possibility of one incident affecting the other is low, Poisson can be surprisingly realistic. Quite remarkable, given how simple it is. All in all not that bad an assumption for a back of the envelope calculation in a meeting.

In practice, however, given more time, I would be looking at the statistics of inter-incident times more carefully. If those look sufficiently different from Exponentially distributed, a non-Poisson renewal process might be more appropriate than a Poisson process.

But why even start with an assumption that is so likely to be wrong? We know that incidents are frequently correlated. We know that scale and complexity add fragility. We know that GitHub has gotten bigger and more complex. The chance of the probability distribution holding constant over 5 years of major growth is basically zero.
We are talking about different things. One is about attributing causes to an increase in failure rate, the other is about verifying whether there is any material increase in the rate at all. My comment addresses the latter as a back of the envelope calculation.

Strictly speaking, when looked at through a fine toothed comb, yes the assumptions are very likely wrong. All models are wrong [0], but some of them are useful.

The question is can we get some useful conclusions from such a simple model. In my experience I have been surprised by how often low failure rates are captured well by Poisson processes. Yes the assumptions could be wrong, but are they very likely to lead to wrong conclusions ? Empirical experience and math says otherwise.

There are sound reasons for why this happens. If you are interested, you can pick that up from Feller. These [1] [2] links might also help.

Given the data that we have, its a plenty good first cut, but that's what it is -- a first cut. With more data one can do a more refined analysis.

[0] https://en.wikipedia.org/wiki/All_models_are_wrong

[1] https://en.wikipedia.org/wiki/Poisson_point_process#Approxim...

[2] https://en.wikipedia.org/wiki/Poisson_point_process#Converge...

I, for one, enjoyed your little foray into the field of statistics. So much so, that I'd love to learn some of this stuff as well! I'm a bioinformatics student, so I have a rigorous math background (analysis, linear algebra), but for some reason, our course is quite light on statistics and probability.

What resource would you recommend to get an intuitive grasp of statistics?

To give you an idea about what kind of resource (book) I'm looking for: I'm currently reading Elements of Statistical Learning and I enjoy that it has all the mathematical rigour I need to really understand why all of it works, but also that it's heavy on commentary and pictures, which helps me to understand the math quicker. Counterexamples: Baby Rudin one one side of the spectrum, The Hundred-Page Machine Learning Book on the other.

Hi Eugelo am glad you liked it. Feller is not a stats book neither is it a machine learning book but you might like it. It is full of cute and relatable exercises.

Books like ESL are front end books, they cover the shiny and the methods. Feller is more of the backend.

Thanks for the recommendation! When you say "Feller", do you mean this book? [1]

I'm already looking forward to it.

[1]: https://www.amazon.com/Introduction-Probability-Theory-Appli...

I see where you're coming from and I agree with you overall, in particular how this is a first cut approximation.

But I still want to nitpick the details a bit. If you want to determine whether there was a change in the failure rate, you need to use rate statistics - failures per service-hour. Your analysis is only using the numerator while we know the denominator (the number of services in GitHub that can go out) has increased over time - GitHub Actions and Packages are relatively new.

Totally agree with your second paragraph. As I said it was more of a text bookish exercise. If the volume of traffic in the two periods were available, it would have been possible to do the kind of analysis you indicate.