|
|
|
|
|
by vorador
3797 days ago
|
|
I have no doubt the people at Github have spent a lot of time thinking about multi-region failover. You never hear about the successful failovers --- only the ones which cause outages. To quote a famous US politician: "There are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know." You can't failover things you didn't predict. |
|
1. Degraded performance that might be a fault justifying fail-over. A human in the loop is a must here as complex services can just act weird under load or randomly.
2. Corrupted data or packets coming in that might indicate a failure. Might automatically fail-over here.
3. No data coming in at all for 5-10 seconds, esp on a dedicated line. Fail-over automatically here as nothing sending data is already the definition of downtime and probably indicates a huge failure.
Companies should also do plenty of practice fail-overs at various layers of the stack during non-critical hours to ensure the mechanisms work. In Github's case, number 3 should've applied and solutions far back as 80's would kick in automatically within seconds to minutes. Their tech or DR setup must just not be capable of that. There could be good financial reasons or something for that but not technological.