Hacker News new | ask | show | jobs
by tinco 1155 days ago
So this is probably too soon, thoughts and prayers for the datacenter operators and staff out there, but are they going to auction off the flooded hardware? Trying to restore a flooded Google rack sounds like a super fun project.

Anyone experience with losing an entire DC to flooding?

edit: I just Googled it (lol) and this DC has to be brand spanking new (https://cloud.google.com/blog/products/infrastructure/google...), apparently they just opened it last June. Google must be livid with the contractors who built the place for it to get flooded so soon.

10 comments

2015 Chennai (South India) Floods. It was the flood of a century. [1]

Our DC was intact, but the building and access was cut-off. We lost the backup diesel power generators in the flooding. Of course, grid power was cut-off.

Our DC operating team managed to shutdown all the servers and racks cleanly before UPS power was completely drained. The 4 engineers and 2 security guards then swam out of the compound in chest high waters. (I am not kidding).

When the rains subsided and the flood waters receded after a couple of days, we had to plan the restart. The facility still had to be certified by health and safety, but we needed to get the datacenter back up.

A secondary operations site that would remote-connect to the DC was brought up in 1 week since we estimated the rains to potentially continue for a few more days and cause interruptions. But the critical item for the plan to work was getting a new backup power setup. We rolled in a truck-mounted diesel generator and positioned it in the highest point in the campus (also close to our building tower that had the DC) and ran power cables to it (we had to source this and it was a challenge to do it with the time crunch and the rains).

We moved staff to other cities by bus (airport was shutdown) as part of our recovery plan, but we still needed connectivity to our DC for some of the critical processes.

Long story short, it worked.

I'll never forget the experience and the scars from this war story.

[1]: https://en.wikipedia.org/wiki/2015_South_India_floods

Ha, you bring back old memories. We had the largest compute footprint in India at that time in Ambattur (Chennai industrial suburb). This particular DC in question was as multi-story building and the ground-floor itself was several ft above road level and there was the huge natural lake in front. Luckily heavy rains only caused havoc to the road-side storm-drains and road traffic. And we had more than 250K liters of diesel to last us more than 24 hours and we had several tankers on standby. So we didn't have to shutdown anything. Funny thing is we had selected this site less than a year ago and had discussed the 100 year flood lines and worst case probabilities of heavy rains and flooding etc. Being well-prepared really paid off.
Yes. It was a miracle that Ambattur did not suffer as much given the proximity to Redhills lake reservoir. Had the Water Resource Department also opened the sluice gates of the Redhills reservoir like Chembarampakkam lake during the floods and incessant rains, the situation would have been different. Given Ambattur was accessible and relatively unaffected, that was the location we brought up our alternate operating site within a week.

In any case, it is good you didn't have to go through a DC recovery during one of the worst disasters in the 21st century.

The question I keep asking in all DR planning sessions/table top exercises is - what would we do if we had a situation like what happened in Fukushima or in Chennai 2015. In both cases, flooding caused failure of backup power generators. Also, what do we do when we have all or partial resources, but are faced with a denial-of-premises situation (what I faced).

I once was a customer of a DC who's roof drainage was clogged, turning it into a lake after a couple of rain storms. It then proceeded to rain inside the DC as the roof started to leak from all the pressure.

"Servers are down, I'll head over to the DC" turned into "Um... it's raining _in the DC_. Get me some tarps and get us cut over to the backup in the office".

Ah, the glory days of running out of a single co-lo across the parking lot with our "backup site" being a former broom closet.

As someone who has owned two commercial flat roof buildings I cant stress enough that you MUST do inspections of your roof at least twice a year. Especially if you live in a big city. I've had backups caused by kids roofing balls and bottles, stolen purse, dead squirrel, dirty balled up diapers from the neighboring apartment building. City living for ya.
Yeah, I'm pretty sure in this case it was a combination of having a 4ft parapet around the entire roof, and having basically never done an inspection. Not enough drains and they were all full of leaf matter.
Many years ago, I managed a server room with dedicated cooling on the 4th floor of a 4-story building with a flat roof. One night the temp alarms went off, and when I showed up water was dripping off my overhead Liebert unit and onto the racks.

And it wasn't even raining outside! So I grab some plastic to cover the racks and phone in emergency portable cooling as the room's AC started failing.

It turns out earlier that day, a technician performing seasonal maintenance on a boiler tank on the roof had drained the tank and refilled it. But instead of directing the water out into a proper drain, he sent it down a convenient pipe that was actually a vent from our ceiling into the boiler house. The boiler was dozens of meters from my server room, but the water followed the old steel and plaster ceiling remnants over to my computers.

And this boiler water was more exciting than rain: it came with all the dissolved minerals, metals, and preservatives computers crave! I didn't lose any computers in the racks, but it killed the Liebert's control board.

The machines are not industry standard stuff, and they don't auction, they destroy for customer security. See here: https://www.datacenterknowledge.com/google-alphabet/robots-n...
Just the drives are destroyed. The servers themselves end up in all sorts of spots:

https://www.ebay.com/b/Google-Server/11211/bn_7023306662

Those are all Google search appliances, Google sold those. They're not operated by Google themselves.
I'm not sure what the disk encryption story is in Google Cloud but I'd rather it didn't end up on Ebay. Mind you, "flooded" covers a wide range of possibilities and a surprisingly small amount of water ingress would trip a breaker while leaving the racks in good order.
All data in encrypted at rest, and all hard drives are destroyed on site.
> a surprisingly small amount of water ingress would trip a breaker while leaving the racks in good order.

If that were the case they wouldn't be saying "There is no current ETA for recovery," and "it is expected to be an extended outage. Customers are advised to failover to other regions."

There's a lot more to a datacenter building than just the servers sitting on racks. In particular here there was a fire in the power-serving infrastructure (caused by the flood presumably). So nearly all of those servers could be totally fine, just off, but if the power distribution network in the building is literally fried, that's gonna take a long time to fix.
Starting up a cloud region after a total shutdown is likely an untested procedure with no well known timeframe, even if the hardware is ok.
If you're in the business of being a massive cloud provider, hopefully restarting a region is not an untested procedure for you.

You could always test this in a live environment before a region becomes open to customers.

“Test in a live environment before the region becomes open to customers” is a test that’s not entirely representative for “the region had an emergency shutdown with customers on it.” And the latter is something that you can’t reliably test obviously - unless you decide to crash a whole region in live traffic.

I’m sure they have checklist and procedures, but an unknowable laundry list of things will go wrong.

You're right. It's not untested at all. It's just not instantaneous, unfortunately. :)
Having (for example) 6 inches of water in your 115kV switch room is a small-scale problem that can cause a large-scale outage.
Better than when Planet's DC actually exploded [1].

Restoration is hard when health and safety are in question. Good luck to these ops folks <3

[1] https://www.datacenterknowledge.com/archives/2008/06/01/expl...

A long time ago, one server room (located in the basement of the university building) of SPB-IX was flooded. It was a fun day for engineers whom unplugged survived equipment standing knee-deep in water

It was before dam (1) was built and floods were a huge problem in SPB

[1]: https://en.wikipedia.org/wiki/Saint_Petersburg_Dam

Umm thoughts and prayers? It's not as if their house is being washed away :) They just have a busy day at work. Keeps things exciting :P
I doubt they would let anyone have access to their hardware. There is a ton of proprietary stuff in there
> but are they going to auction off the flooded hardware?

I wonder how many inches/feet we're talking here? The hardware on the top (unless it experienced electrical short) is most likely fine?

Likely not. It’s also not Google’s first dc flood/water intrusion causing a GCP incident.