| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nickpsecurity 3793 days ago

Except you can predict it. Your fail-over mechanism needs to be able to detect these things:

1. Degraded performance that might be a fault justifying fail-over. A human in the loop is a must here as complex services can just act weird under load or randomly.

2. Corrupted data or packets coming in that might indicate a failure. Might automatically fail-over here.

3. No data coming in at all for 5-10 seconds, esp on a dedicated line. Fail-over automatically here as nothing sending data is already the definition of downtime and probably indicates a huge failure.

Companies should also do plenty of practice fail-overs at various layers of the stack during non-critical hours to ensure the mechanisms work. In Github's case, number 3 should've applied and solutions far back as 80's would kick in automatically within seconds to minutes. Their tech or DR setup must just not be capable of that. There could be good financial reasons or something for that but not technological.