| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by truthseeker11 2567 days ago
	The outage lasted two days for our domain (edu, sw region). I understand that they are reporting a single day, 3-4 hours of serious issues but that’s not what we experienced. Great write up otherwise, glad they are sharing openly

2 comments

jacques_chester 2567 days ago

Outages like these don't really resolve instantly.

Any given production system that works will have capacity needed for normal demand, plus some safety margin. Unused capacity is expensive, so you won't see a very high safety margin. And, in fact, as you pool more and more workloads, it becomes possible to run with smaller safety margins without running into shortages.

These systems will have some capacity to onboard new workloads, let us call it X. They have the sum of all onboarded workloads, let us call that Y. Then there is the demand for the services of Y, call that Z.

As you may imagine, Y is bigger than X, by a lot. And when X falls, the capacity to handle Z falls behind.

So in a disaster recovery scenario, you start with:

* the same demand, possibly increased from retry logic & people mashing F5, of Z

* zero available capacity, Y, and

* only X capacity-increase-throughput.

As it recovers you get thundering herds, slow warmups, systems struggling to find each other and become correctly configured etc etc.

Show me a system that can "instantly" recover from an outage of this magnitude and I will show you a system that's squandering gigabucks and gigawatts on idle capacity.

link

truthseeker11 2567 days ago

Unless I’m misunderstanding Google blog post they are reporting ~4+ hours of serious issues. We experienced about two days.

If it was possible to have this fixed sooner I’m sure they would have done that. That’s not the point of my comment tough.

link

jacques_chester 2567 days ago

The root cause apparently lasted for ~4.5 hours, but residual effects were observed for days:

> From Sunday 2 June, 2019 12:00 until Tuesday 4 June, 2019 11:30, 50% of service configuration push workflows failed ... Since Tuesday 4 June, 2019 11:30, service configuration pushes have been successful, but may take up to one hour to take effect. As a result, requests to new Endpoints services may return 500 errors for up to 1 hour after the configuration push. We expect to return to the expected sub-minute configuration propagation by Friday 7 June 2019.

Though they report most systems returning to normal by ~17:00 PT, I expect that there will still be residual noise and that a lot of customers will have their own local recovery issues.

Edit: I probably sound dismissive, which is not fair of me. I would definitely ask Google to investigate and ideally give you credits to cover the full span of impact on your systems, not just the core outage.

link

truthseeker11 2567 days ago

That’s ok, I didn’t think your comment was dismissive. Those facts are buried in the report. Their opening sentence makes the incident sound lesser than what it really was.

link

tweenagedream 2567 days ago

What does your stack look like?

It's hard to tailor a postmortem like this to everyone's individual experience but it is surprising to me that your experience is so different.

link

truthseeker11 2567 days ago

I know what you meant; however, reports should not be tailored to individual experience. The facts should be reported clearly. I’m happy they are open about the whole incident. -4 hours was more like two days for us.

Our stack? Multiple OC wan, 10G LAN with 1Gpbs clients. About 4,000+ users, EDU. We are super happy using Google. No complaints! Google is doing great.

link