Hacker News new | ask | show | jobs
by xrayarx 1155 days ago
Water intrusion in europe-west9-a has caused a multi-cluster failure and has led to an emergency shutdown of multiple zones. We expect general unavailability of the europe-west9 region

https://twitter.com/GCP_Incidents

2 comments

But europe-west9-a is only one zone, why does the whole region fall over as a consequence?
GCP has multiple zones in the same physical building. Not all cloud providers have distinct physical buildings for each Availability Zone.
Do they have an official description what a zone is somewhere?

Back in the days when we had our own data centers a zone was defined as a "fire section" meaning that it should not be impacted if any other zone of the data center had a fire. This obviously means that you can't call 3 floors of a building a zone.

Edit: The information on this site https://cloud.google.com/docs/geography-and-regions#regions_... clearly states that a zone is "physically distinct" so they have some explaining to do.

Edit 2: Sneaky... They changed the status page to say "europe-west9" instead of "europe-west9-a".

I could not find the GCP equivalent to this from AWS:

"AZs are physically separated by a meaningful distance, many kilometers, from any other AZ, although all are within 100 km (60 miles) of each other."

https://aws.amazon.com/about-aws/global-infrastructure/regio...

Physically distinct could refer to distinct hardware in the same building and cage space. It’s “physically distinct”. Google makes no promises that the zones are in different buildings or separated by N feet/miles of space.
AFIK all zones (a, b, and c) have been reported to be down. I'd love to understand ehat happened.
A cooling pipe started leaking and set the batteries on fire.
Probably some dependencies they did not plan for
Switching off all of 1 zone and checking the others aren't impacted is literally step 1 of checking your organisation is truly zonally redundant...

Someone as big as Google ought to have been practicing this automatically every week in a staging environment, and probably at least annually in production.

How large is the flood? How far away are these zones?
It happened at GlobalSwitch Clichy, near Paris. From what I gathered from a french forum[1], it started with a flood and then a fire. No rooms have been affected, apparently.

[1]: https://lafibre.info/datacenter/incendie-maitrise-globalswit...

Cooling pump failure lead to water leaking into the UPS room. Batteries caught fire and firefighters can't access the room. Fire is contained although.

(this was at ~10-11 am GMT+2 time)

Edit:

Fire is extinguished (~3pm GMT+2)

https://www.mail-archive.com/frnog@frnog.org/msg72320.html

https://www.mail-archive.com/frnog@frnog.org/msg72323.html

https://www.mail-archive.com/frnog@frnog.org/msg72327.html

I'm getting horrible flashbacks of OVH DC's those many years ago.
THAT was many years ago? Felt like yesterday
A bit more than 2 years: https://www.datacenterdynamics.com/en/news/fire-destroys-ovh...

I thought it was less than 12 months...

It does feel like yesterday but yeah was a couple years back !
We were impacted at a previous company, luckily we had solid backup, so everything was back online a few hours after.

Still, it was kinda fun to go to work and learn that the corporate website literally went up in flames.

What a disaster. A datacenter made out of wood, what could go wrong ...
If it's the one in Clichy I'm thinking of it's dug into the embankment that lines a railway basin, so... yeah, floods suck.
It is the Clichy's one. It's not that dug into, where did you get that from? (Used to work there circa 2010). I think the water retention made its way to the battery rooms. No recent floods (nor rain) in Paris (nor most of france) lately.