Hacker News new | ask | show | jobs
by aaronblohowiak 5022 days ago
If Github hasn't gotten their custom HA solution right, will you?

Digging into their fix, they disabled automatic failover -- so all DB failures will now require manual intervention. While addressing this particular (erroneous) failover condition, it does raise minimum down time for true failures. Also, their mysql replicant's misconfiguration upon switching masters is also tied to their (stopgap) approach to preventing the hot failover. So, the second problem was due to a mis-use/misunderstanding of maintenance-mode.

How is it possible that the slave could be pointed at the wrong master and have nobody notice for a day? What is the checklist to confirm that failover has occurred correctly?

There is also lesson to be learned in the fact that their status page had scaling issues due to db connection limits. Static files are the most dependable!

3 comments

It blows my mind that they aren't simply using Jekyll to generate and update the status page. I mean... they wrote it, right?
I think people tend to overestimate the value of nines to the user. It's chiefly a management/VC/busybody metric that has gained importance mainly due to it being a high level and easy to understand abstraction. "Well how much was it down?" Then they spend zillions on failover software, hardware and talent that could be supplanted by one fewer nine and a simpler architecture.

And really, just to get a dig in here, I believe Arrington shares a big part of the blame for this state of affairs with all of his Dvorak-caliber ignorant harping about Twitter back in the day.

"There is also lesson to be learned in the fact that their status page had scaling issues due to db connection limits. Static files are the most dependable!"

Seriously, why would a status page need to query a db?

I assume that the status server is not actively checking every Github server/service whenever someone pings it. It probably polls the servers every X seconds. The best place to store that type of data is in a DB.

Where else would you put it?

> It probably polls the servers every X seconds.

And then you could write out a new static file, just once, and send it to your edge server of choice.

You could just as easily store the result in a plain file somewhere... a database seems like overkill.