| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by starttoaster 957 days ago

I actually disagree, and think that the post mortem clearly defines that there were things that were disappointing that happened with the vendor, _as well as_ things that were disappointing that happened internally. I don't think that it's unfair to point out everything in an event that happened; I do think it would be unfair to ignore all the compounding issues that were in the power of the vendor, and just swallow all of the blame for an event, when a huge reason that businesses even go through vendors at all is to have an entity responsible for a certain set of responsibilities that the business in question doesn't feel they have the expertise to do themselves. Which implies a relationship built on trust, and it's fair to call out when trust is lost.

And even though Cloudflare did put some of the blame, as it were, on the vendor, the post mortem recognizes that Cloudflare wasn't doing their due diligence on their vendor's maintenance and upkeep to verify that the state of the vendor's equipment is the same as the day they signed on. And that's ignoring a huge focus of the post mortem where they admit guilt at not knowing or not changing the fact that Kafka and Clickhouse were only in that datacenter.

Furthermore, we do not know that Cloudflare didn't get the vendor's blessing to submit that diagram to their post mortem. You're assuming they didn't. But for what it's worth as someone that has worked in datacenters, none of this is all that proprietary. Their business isn't hurt because this came out. This is a fairly standard (and frankly simplified for business folk) diagram of what any decently engineered datacenter building would operate like. There's no magic sauce in here that other datacenter companies are going to steal to put Flexential out of business. If you work for a datacenter company that doesn't already have any of this, you should write a check to Flexential or their electrical engineers for a consultancy.

And finally, the things that Cloudflare speculated on were things like, to paraphrase, "we know that a transformer failed, and we believe that its purpose was to step down the voltage that the utility company was running into the datacenter." Which, if you have basic electrical engineering knowledge, just makes sense. The utility company is delivering 12470 volts, of course that needs to be stepped down, somewhere along the way, probably multiple times, before it ends up coming through the 210 volt rack PDUs. I'm willing to accept that guess in the absence of facts from the vendor while they're still being tight lipped.

However, that's not to say I'm totally satisfied by this post mortem either. I am also interested in hearing what decisions led to them leaving Kafka and Clickhouse in a state of non-redundancy (at least at the datacenter level) or how they could have not known about it. Detail was left out there, for sure.

1 comments

namibj 956 days ago

That isn't a voltage change where you'd use multiple transformers in sequence generally, let alone if it's at the same site for the main/primary feed. A redundant feed counts the same, just to be clear, it's more that some low-power/"control plane of the electrical switchyard" applications may use a lower voltage if conveniently available, even if that means a second transformation step from the generators/grid to the load.

That said, the existence of the 480V labeled intermediary does suggest they have a 277/480 V outside system, and a 120/208 V rack-side system.

link