Hacker News new | ask | show | jobs
by srean 2181 days ago
We are talking about different things. One is about attributing causes to an increase in failure rate, the other is about verifying whether there is any material increase in the rate at all. My comment addresses the latter as a back of the envelope calculation.

Strictly speaking, when looked at through a fine toothed comb, yes the assumptions are very likely wrong. All models are wrong [0], but some of them are useful.

The question is can we get some useful conclusions from such a simple model. In my experience I have been surprised by how often low failure rates are captured well by Poisson processes. Yes the assumptions could be wrong, but are they very likely to lead to wrong conclusions ? Empirical experience and math says otherwise.

There are sound reasons for why this happens. If you are interested, you can pick that up from Feller. These [1] [2] links might also help.

Given the data that we have, its a plenty good first cut, but that's what it is -- a first cut. With more data one can do a more refined analysis.

[0] https://en.wikipedia.org/wiki/All_models_are_wrong

[1] https://en.wikipedia.org/wiki/Poisson_point_process#Approxim...

[2] https://en.wikipedia.org/wiki/Poisson_point_process#Converge...

2 comments

I, for one, enjoyed your little foray into the field of statistics. So much so, that I'd love to learn some of this stuff as well! I'm a bioinformatics student, so I have a rigorous math background (analysis, linear algebra), but for some reason, our course is quite light on statistics and probability.

What resource would you recommend to get an intuitive grasp of statistics?

To give you an idea about what kind of resource (book) I'm looking for: I'm currently reading Elements of Statistical Learning and I enjoy that it has all the mathematical rigour I need to really understand why all of it works, but also that it's heavy on commentary and pictures, which helps me to understand the math quicker. Counterexamples: Baby Rudin one one side of the spectrum, The Hundred-Page Machine Learning Book on the other.

Hi Eugelo am glad you liked it. Feller is not a stats book neither is it a machine learning book but you might like it. It is full of cute and relatable exercises.

Books like ESL are front end books, they cover the shiny and the methods. Feller is more of the backend.

Thanks for the recommendation! When you say "Feller", do you mean this book? [1]

I'm already looking forward to it.

[1]: https://www.amazon.com/Introduction-Probability-Theory-Appli...

That indeed is the book, just to set expectations its an old style book. I don't recall any diagrams. You may find an online copy in the usual places to check it out before you buy.
No worries, I'm carefully weighting each purchase, especially when it's one third of my monthly budget (greetings from Europe!).

It seems to me that it's considered a classic, although I've never heard of it (probably due to my ignorance). Do you have any more such nice recommendations up your sleeve? Don't limit yourself to probability, I'm looking for some reading for the summer :-D

You can try "All of Statistics" its a concise but useful and modern take on statistics. For a different approach I quite like Allen Downey's ".. for the Hacker" series. For stochastic processes Parzen's "Stochastic Processes" is a nice and approachable read. If you want to go down the rabbit hole I would recommend graycat's comment stream here on HN. Time to time he posts about books to read.

You said you are familiar with linear algebra. The logical next stop could be Hilbert spaces. It looks at functions as vectors and analyzes their properties using linear algebraic tools that work even in infinite dimensional spaces. This sees quite a heavy use in traditional machine learning. Before diving into Hilbert spaces proper, you could revisit linear algebra in Halmos' "Vector Spaces" there he pretends to teach you linear algebra but actually teaches you about Hilbert spaces -- in other words, teaches you linear algebra but without the restriction of finite dimensionality.

And you are right, books are so damn expensive. India is somewhat better in the sense that we have 'low price editions' same content but printed in lower quality paper, not the prettiest things, but very student friendly. Note these are legit printings, not pirated copies.

I see where you're coming from and I agree with you overall, in particular how this is a first cut approximation.

But I still want to nitpick the details a bit. If you want to determine whether there was a change in the failure rate, you need to use rate statistics - failures per service-hour. Your analysis is only using the numerator while we know the denominator (the number of services in GitHub that can go out) has increased over time - GitHub Actions and Packages are relatively new.

Totally agree with your second paragraph. As I said it was more of a text bookish exercise. If the volume of traffic in the two periods were available, it would have been possible to do the kind of analysis you indicate.