|
|
|
|
|
by fixermark
2659 days ago
|
|
I have no inside knowledge of this one, but broadly speaking, these sorts of failures can be caused by a change thought innocent at the time to the core software that is then widely deployed using automated systems. If the core's tests didn't catch a real issue in production (and for whatever reason, the rollout happens faster than the regular small-release verification process can catch the error), things can go sour in a way that's expensive to un-sour. Amazon once pushed a seemingly-innocuous change to their internal DNS that caused all the routers between and within datacenters to drop their IP tables on the floor. They had to re-establish the entire network by hand---datacenter heads calling each other up and reading IP address ranges over the phone to be hand-entered into lookup tables. Cost a fortune in lost sales for the time the whole site was inaccessible. |
|