Cloudflare Dashboard and API Outage on April 15, 2020

Y	Hacker News new \| ask \| show \| jobs

	Cloudflare Dashboard and API Outage on April 15, 2020 (blog.cloudflare.com)
	32 points by jplevine 2254 days ago

6 comments

atonse 2254 days ago

This is not a critique, CloudFlare is clearly a solid, well engineered system given its scale, just look at their other post-mortems.

But it's just kind of interesting, you can have all the redundant systems and smart software and some dude could accidentally pull cables – oh humans!

Would love to see what other mitigations they came up with than the ones listed (apart from probably putting 20 BRIGHT RED labels next to the patch panels saying DO NOT DISCONNECT, EVER EVER EVER!).

Perhaps one mitigation could be a better way to literally identify who's there and call them up within seconds and ask what they just did?

link

bogomipz 2254 days ago

>"Documentation: After the cables were removed from the patch panel, we lost valuable time identifying for data center technicians the critical cables providing external connectivity to be restored. We should take steps to ensure the various cables and panels are labeled for quick identification by anyone working to remediate the problem. This should expedite our ability to access the needed documentation."

So they failed to label their cables? I'm sorry but this is "datacenter 101" stuff. How are none of the cables plugged into your patch panels labeled? Every colo has a label gun you can borrow! Also remote hands will gladly send you a pic of a rack or cabinet to verify what they're looking at.

link

ahofmann 2254 days ago

Why is this post being voted down? It is extremely impressive what Cloudflare has done since its foundation. My company has been a customer since 2011 because of me, and yet Cloudflare looks like a nice shiny shell with the same total chaos underneath as in any small IT company I've ever had the chance to look into. Unfortunately this doesn't let me sleep well when my company is dependent on Cloudflare. Therefore we hardly use any features of Cloudflare to be able to switch to our own infrastructure at any time. As annoying as Google and Microsoft are, I can sleep better because they have their processes better under control (I know that these companies offer different products, the question of dependency remains the same).

link

bogomipz 2254 days ago

There seems to be an almost unspoken rule on HN that you don't say anything critical about Cloudflare. It will almost universally be downvoted even if it's done respectfully. You will even see people fawn over their post-mortems yet still not be critical or express disappointment at being affected by the actual outage. I am not sure why this is. It is very peculiar though. It's stranger given that Cloudflare seemingly uses HN as their primary marketing channel to generate discussion about the company but somehow that discussion should only ever be positive commentary.

link

idrism 2254 days ago

It’s strange to me that their remediation did not include distributing these systems to be redundant across multiple datacenters, maybe with a globally distributed database.

> we knew that the failback from disaster recovery would be very complex

The disaster recovery failover to a second data center (and failback) should not force a choice to failover or not. They should be able to immediately failover and the system should self-heal once the original data center was back online.

link

cookiecaper 2254 days ago

I'll just leave this here ... https://github.com/netbox-community/netbox

link

rkwasny 2254 days ago

In summary, 10% of internet traffic relies on one patch panel somewhere :)

link

atonse 2254 days ago

The cloud is just someone else's computer right? :-)

Submarine Cables are like this too. It all comes down to a quarter inch thick bunch of fibers (each being thinner than a human hair)

link

majjaa 2254 days ago

Is it me or does it feels like these post mortem blog post are becoming extremely common with Cloudflare.

link