| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by JsonDemWitOster 9 days ago

The Google SRE book offers the following as one of the reasons to not gun for 100% reliability (emphasis added):

> users typically don’t notice the difference between high reliability and extreme reliability in a service, because the user experience is dominated by less reliable components like the cellular network or the device they are working with. Put simply, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability!

I've been on a shaky relationship with my ISP of late. What brought me to this thread today is that I couldn't push to Github. Notably this isn't covered by their downtime report so, going by the available facts, it's _probably_ not Github's fault I couldn't push; and I've just been on my daily stand-up call and I got disconnected so frequently.

But looking beyond today's available facts, odds are there's a bigger problem GH is not mentioning in their status page. They say the current incident has to do with "unauthorized users" and I wonder if pushing a commit from my IDE client counts as an operation from an "unauthorized user" as I still have to authorize with my SSH key.

It's just insane I can't decide which between Github or German o2 should be the more reliable service!

4 comments

ownagefool 9 days ago

Github isn't having a debate over how many 9s they have, they're having a zero 9s problem.

I think there's 3 big themes with this, thought not

1. LLM tools have added considerable load.

2. LLM used by developers to increase velocity seem to be leading more outages. This calls into question the increased velocity.

3. Roadmaps focused on pushing features that aren't reliability problems. i.e. github moving to azure, or adding AI features.

All these same problems happen to orgs with other fads that aren't AI. Following fads is not good engineering.

link

Grombobulous 8 days ago

Your comment made me think: if GitHub was a Google product with similar popularity and scaling trajectory, would we see similar reliability issues?

Absolutely not. Google has reliability practices so deeply ingrained in their company they’re like an involuntary reflex.

This is a management issue.

link

PeterStuer 8 days ago

So they failed to manage growth. That is a business management problem first, and only a technical problem second. Yet Github management seems to constantly deflect to operations.

If you take on load (this is 100% by choice) beyond capacity, then obviously the system collapses.

link

rurban 8 days ago

Nope. It's entirely azure management fault. https://isolveproblems.substack.com/p/how-microsoft-vaporize...

link

ownagefool 5 days ago

Whilst that's crazy, github hasn't migrated to azure yet, so it's probably not exclusively an azure problem and we've seen the same problems with amazon too ( and several other large orgs ).

link

rurban 5 days ago

Github thanksfully not, but GH actions are running entirely on Azure. They've setup actions as a testbed for Azure afterall.

link

ownagefool 5 days ago

I actually don't have a problem with runners very often, but I must admit I use a combination of public and private the leans private. Are you seeing elevated errors on actions?

link

kelseydh 9 days ago

Apparently Github is experiencing a huge increase in usage due to LLMs and this is the cause for a lot of their instability as of late.

link

PeterStuer 8 days ago

'Experiencing' makes it sound like they have no deliberate choice. No, they let this happen, by choice. They could have prevented this, contractually, by pricing, by governance, but chose not to.

link

IshKebab 9 days ago

> Put simply, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability!

Sure they can. If Google loads and Github doesn't, then it's clearly Github being down, not the mobile network.

Also not everyone uses a phone. My desktop & fibre internet has way better than 99% reliability.

link

spondyl 9 days ago

"unauthorized" is a bit different than "unauthenticated". The former suggests trying to access something you don't have permission for while the latter suggests you're just not logged in.

At a guess, I could imagine some sort of failure of cached pages, which can be cached for signed out users but probably not for signed in users (as the rendered HTML would need to have user context like their avatar etc)

link