| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by eduardogarza 140 days ago
	Can someone that's worked at one of these big companies honestly explain how it happens that when these guys are down, it's never for like 10-15 mins ... it's always 1-2+ hours? Do they not have mechanisms in place to revert their migrations and deployments? What goes on behind the scenes during these "outages"?

5 comments

aix1 140 days ago

Part of it observability bias: longer, more widespread outages are more likely to draw signficant attention. This doesn't mean that there aren't also shorter, smaller-scope outages, it's just that we're much less likely to know about them.

For example, if there's a problem that gets caught at the 1% stage of a staged rollout, we're probably not going to find ourselves discussing it on HN.

link

jcfrei 140 days ago

Quick fixes have tendencies to break other stuff and just make matters worse. Better to leave it offline for a little longer, fix the definitive root issue and make sure it comes online nicely. If the issue was just a quirk in a recent deployment then these probably can be reverted easily on the endpoints where they were just deployed (I'm sure they are using staggered roll-outs). These long term downtime things are probably not issues related to a recent release.

link

Ocerge 140 days ago

You will run into thundering herd/hotspotting/pre-warmed caching issues when you have to restart. There's generally not an easy to way to switch these sorts of systems on and off, especially a relatively new system that isn't battle-hardened.

I got nothing for the github outages this year though, that seems like incompetence.

link

dconsorte 138 days ago

They probably use every ounce of compute available, all at once, because demand > supply. So, there's no fallback mechanism.

link

mrguyorama 140 days ago

Well when the coding agents go down who are they supposed to ask what the problem is?

They should probably buy subscriptions to those Chinese agents.

link