Hacker News new | ask | show | jobs
by ownagefool 16 days ago
Github isn't having a debate over how many 9s they have, they're having a zero 9s problem.

I think there's 3 big themes with this, thought not

1. LLM tools have added considerable load.

2. LLM used by developers to increase velocity seem to be leading more outages. This calls into question the increased velocity.

3. Roadmaps focused on pushing features that aren't reliability problems. i.e. github moving to azure, or adding AI features.

All these same problems happen to orgs with other fads that aren't AI. Following fads is not good engineering.

3 comments

Your comment made me think: if GitHub was a Google product with similar popularity and scaling trajectory, would we see similar reliability issues?

Absolutely not. Google has reliability practices so deeply ingrained in their company they’re like an involuntary reflex.

This is a management issue.

So they failed to manage growth. That is a business management problem first, and only a technical problem second. Yet Github management seems to constantly deflect to operations.

If you take on load (this is 100% by choice) beyond capacity, then obviously the system collapses.

Nope. It's entirely azure management fault. https://isolveproblems.substack.com/p/how-microsoft-vaporize...
Whilst that's crazy, github hasn't migrated to azure yet, so it's probably not exclusively an azure problem and we've seen the same problems with amazon too ( and several other large orgs ).
Github thanksfully not, but GH actions are running entirely on Azure. They've setup actions as a testbed for Azure afterall.
I actually don't have a problem with runners very often, but I must admit I use a combination of public and private the leans private. Are you seeing elevated errors on actions?