What manner of failure would cause such globally deployed and distributed systems to go down like this? I'm very interested to read up on this when they release details of the failure.
I work for a smaller but comparably large platform. "If everything is down check the DB" is at the top of one of our internal monitoring websites in red.
Screw ups related to data loss are rare (I've been here years and haven't seen one with the DBs that the stuff I work with uses) but failures at this scale tend to cascade a little ways and it takes time to dig out of the hole. They probably have the problem solved but they have to spend a bunch of time synchronizing things and verifying the fix before they press the big red "go live" button.
We have a different dedicated page that gives an overviews of what's going on with the DB. The page in question is supposed to be a single stop that lets you visually get an overview of the state of the application servers and whether things are "normal" and if not allow you to quickly identify what is not normal.
I have no inside knowledge of this one, but broadly speaking, these sorts of failures can be caused by a change thought innocent at the time to the core software that is then widely deployed using automated systems. If the core's tests didn't catch a real issue in production (and for whatever reason, the rollout happens faster than the regular small-release verification process can catch the error), things can go sour in a way that's expensive to un-sour.
Amazon once pushed a seemingly-innocuous change to their internal DNS that caused all the routers between and within datacenters to drop their IP tables on the floor. They had to re-establish the entire network by hand---datacenter heads calling each other up and reading IP address ranges over the phone to be hand-entered into lookup tables. Cost a fortune in lost sales for the time the whole site was inaccessible.
As someone who works at a large company in the networking space, you would be surprised that minor changes to configuration can cause catastrophic failures that are really challenging to come back from
Network failures are usually really bad when your system is globally deployed and distributed -- often times you can't even communicate with your machines to deliver fixes :p
Source: Me. My career has been spent managing db's for internet scale sites.