Hacker News new | ask | show | jobs
by belter 1548 days ago
Excluding ones reported as [Errors], [Scheduled] or [Notifications]

2019 -> 39 Incidents

2020 -> 67 Incidents

2021 -> 86 Incidents

2022 -> 20 Incidents so far

Edit: Using Linear Regression...Prediction for total end 2022: 111 Incidents.

5 comments

I wondered if those error rates were proportional to Github's growth over time, so I looked it up. It seems that they have 40M users in 2019[1] and 73M users in 2021[2], which translates to 0.975 incidents per million users per year in 2019 compared to 1.178 in 2021.

So perhaps they are not exactly improving, but maybe there is some other way to normalize the data.

[1] https://github.blog/2019-11-06-the-state-of-the-octoverse-20...

[2] https://octoverse.github.com/

One would thought when they got acquired by Microsoft that the number of incidents would go down considering all resources Microsoft would provide but no.
GitHub has a lot more features now though. A few years ago you didn’t have GitHub actions or workspaces, mostly a DDoS from Asia once in a while.
The number of incidents isn't so much of a problem as the amount of downtime is. That would be more interesting to see.
GitHub Availability Report [1]

Service Downtime Core Services Only - Cumulative per Month

( Some months with more than one outage)

Jan 2021: 3 hours 53 min

Feb 2021: 1 hour 42 min

Mar 2021: 4 hours 10 min

Apr 2021: 2 hours 20 min

May 2021: 10 hours 34 min

Jun 2021: 0 min

Jul 2021: 0 min

Aug 2021: 4 hours 23 min

Sep 2021: 0 min

Oct 2021: 1 hour 36 min

Nov 2021: 2 hours 50 min

Dec 2021: 0 min

Jan 2022: 26 min

Feb 2022: 13 min

[1] https://github.blog/tag/github-availability-report/

So, if my math is right (for 2021 only): 1888 min / 525,600 min = 99.64% uptime.

If it was more like 99.80+ I think I would be like "meh", but honestly for the price you pay that's not terrible. Still, for a company at the Microsoft level, it should be 99.80 at least.

This is the same Microsoft that reboots laptops in the middle of teams calls to do hour long update cycles. >99% is implausibly good.
thats not the kind of progression you like to see - that is, error rates increasing over time instead of decreasing.
Only if you believe those numbers mean anything. What are the errors for? Github has been adding lots of features and subproducts over the years, becoming a bigger and bigger platform as a result. What you want is the error-per-component, which may very well have actually gone down, with error spikes coming from "when github adds a completely new feature and it goes through a slew of incidents in its first year". The bigger the feature, the more incidents.

Without more detailed numbers, there's literally no conclusion to draw here.

Every place I have ever worked reported incidents going down would be good, not up.
Every place I ever worked at understood that if you x3 the codebase/infra/interaction surface/etc, you can expect x3 errors. If the total number of errors don't go up as you grow you're doing amazing, and if they go down even though you're landing more and more code for more and more features and subproducts, you have a genuine miracle.
These features can't be rolled out incrementally to users? In this day and age it seems weird for a web app to do a global go-live with something before testing it with a smaller group first.
A "smaller group" on github's scale is still large enough to take down an entire sub product like actions, hooks, codegroups, etc.
Reasonable if growth/load is growing, too.
Based on the same interpolation, github will reach one incident per day by 2032.