| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kyledrake 2455 days ago
	What unholy thing did they do that broke it across 12 different datacenters, good lord.

6 comments

alexeldeib 2455 days ago

This does seem to indicate a notable lack of isolation for the blast radius between DO datacenters. Would be interesting to see the post mortem.

link

protomyth 2455 days ago

I get the feeling that whoever writes the post-mortem is going to have a bit of pressure to assure folks that there is isolation going forward.

link

klodolph 2455 days ago

That would be a bad sign that there’s something wrong with the culture. I would hope for a postmortem that identified flaws that genuinely needed to be fixed.

link

viraptor 2455 days ago

Those are not mutually exclusive and actually a good idea. You want to fix this specific issue, but also ensure that whatever process took down one DC doesn't affect other DCs. That's scaling and redundancy 101 - not sure why it would be something wrong.

link

klodolph 2455 days ago

> Those are not mutually exclusive and actually a good idea.

The goals “assuring folks that there is isolation” and “identifying flaws that need to be fixed” are somewhat contrary to each other.

The post-mortem should identify flaws in systems, processes, and thinking. It should not try to assure people that there is isolation when there is evidence to the contrary.

> You want to fix this specific issue, but also ensure that whatever process took down one DC doesn't affect other DCs.

This was a multi-regional failure. So, this specific issue is also an isolation problem, among other things. You will want to ensure that this problem doesn’t happen again but you shouldn’t assure that it won’t.

link

protomyth 2455 days ago

I would think having all zones go down is a flaw that genuinely needed to be fixed.

link

klodolph 2455 days ago

That’s not the flaw, that’s the outcome. The purpose of a post-mortem is to identify the flaws that caused that outcome, and ways to fix those flaws.

link

swsieber 2455 days ago

Whoever broke it is going to feel significant pressure to actually isolate things too.

link

notyourday 2455 days ago

Propagating mistakes across all the things is devops

link

jbarham 2455 days ago

DevOps Borat: "To make error is human. To propagate error to all server in automatic way is #devops."

https://twitter.com/DEVOPS_BORAT/status/41587168870797312

link

nodesocket 2455 days ago

Google Cloud recently had a global outdate. DevOps tools that interact with all resources across data centers are primarily the culprit.

link

sterlind 2455 days ago

Google's RCA is here, I believe: https://status.cloud.google.com/incident/cloud-networking/19...

Reading between the lines, it looks like their maintenance system needed to take down several Borg clusters within a single AZ, and their BGP route reflectors all ran from the same set of logical clusters. They'd tried to set up geo-redundancy by having different BGP speakers across different AZs, but they were all parented by the same set of logical clusters, and the maintenance engine descheduled all of them together. Then the network ran okay ("designed to fail static for short periods of time") until the routes expired, after which routes got withdrawn and traffic blackholed.

They realized the issue within an hour.. unfortunately, since they took down multiple replicas of their network control plane, they lost Paxos primary and had to rebuild configuration.

(Disclaimer: I work in Azure, I just find it fascinating to look at Google's RCAs because failure provides an insight into their architecture and risk engineering.)

link

temikus 2453 days ago

Or just bad rollout procedures.

link

alexeldeib 2452 days ago

potato, potato ;)

link

bluedino 2455 days ago

Probably the old "one command ran on everything"

link

astrodust 2455 days ago

tmux is a dangerous tool in the wrong hands.

link

rubbingalcohol 2455 days ago

to be fair, it's dangerous even in the best hands. mistakes happen but business processes need to be in place to prevent catastrophes...

every time i see something like this, my inclination is to blame the CTO, not the engineer who pulled the trigger.

link

toomuchtodo 2455 days ago

A post mortem should always be a place to highlight deficiencies in processes and communicate necessary improvements put into place, not to blame. Blame should only occur if the cadence of outages becomes excessive. Complex systems are tricky, and to err is human.

Disclaimer: Ops/infra engineer in a previous life.

link

astrodust 2455 days ago

I wonder how many outages these days start with something like "kubectl apply" and then things go horribly awry.

link

solotronics 2455 days ago

We can blame whoever we want but you better believe shit rolls downhill at most places.

link

GhettoMaestro 2455 days ago

Until it is a big enough F-up that an executive's head must roll.

link

mdaniel 2454 days ago

There's a famous corollary to that approach: "Fire him? Why, I just spent 10 million dollars _educating_ him"

(regrettably I can't find any evidence it's a true (quote|story), but I enjoy the sentiment)

link

nonbirithm 2455 days ago

It could be DNS. Azure has had an all-region failure due to a single DNS provider outage. It was possible that same DNS provider's outage was also causing problems for GCE and AWS at the same time.

https://news.ycombinator.com/item?id=19812919

link

dc352 2455 days ago

That wouldn't be at the top of my list. We have "Volumes" for databases and they were inaccessible for like 6 hours. I don't think any DNS is involved in mounting these. But hey, there's always a lot of crap hidden behind the scenes :)

link

markonen 2455 days ago

I would be absolutely amazed if DNS was not involved in mounting a block storage volume.

link

pmlnr 2455 days ago

Bad puppet/ansible/etc commit is the most probably explanation.

link

mdellavo 2455 days ago

dee ennn esss

link

hinkley 2455 days ago

A bug that has no obvious side effects that only became visible once all data centers were upgraded?

Happens. Statistics are hard.

link