| Is there a timeline to how long it took them to figure out Redis was down? Because having experienced the same, you get an alert. Cool. HA-Proxy says app servers are down. Ok. You SSH in and see that everything looks ok but the processes are bouncing. You tail the logs to find out why (obviously lots of these steps could be optimized). Within a few seconds you spot the error connecting to Redis. A minute later you've verified the Redis hosts are offline. That's the first 5 minutes after getting to a computer. After that it doesn't really matter why they're down. You failover, get the site back up and worry about it later. Are these systems on a SAN? That's probably the first mistake if so. Redis isn't HA. You're not going to bounce it's block devices over to another server in the event of a failure. That's just a complex, very expensive strategy that introduces a lot of novel ways to shoot yourself in the face. If you're hosting at your own data-center, you use DAS with Redis. Cheaper, simpler. I've never seen an issue where a cabinet power loss caused a JBOD failure (I'm sure it happens, but it's a far from common scenario IME), but then again, locality matters. Don't get overly clever and spread logical systems across cabinets just because you can. Being involved with this sort of thing more frequently than I'd like to admit, I don't know the exact situation here, but 2h6m isn't necessarily anything to brag about without a lot more context. What's pretty shameful is that a company with GitHub's resources isn't drilling failover procedures, is ignoring physical segmentation as an availability target (or maybe just got really really unlucky; stuff happens), and doesn't have a backup data-center with BGP or DNS failover. This is all stuff that (in theory if not always in practice), many of their clients wearing a "PCI Compliant" badge are already doing on their own systems. |
You bet they busted their ass to get this fixed and shared their learnings with us. I'm extremely grateful for this and yeah it inconvenienced my morning but nothing more.
You make it sound so easy. If it takes the Github folks 2 hours, I can bet it would've taken us much longer.