| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by antpls 2773 days ago
	The status dashboard is inaccurate and/or a lie. It only tells about the GKE incident, while in fact the problem also impacts Google Compute Engine users. I was unable to create any google compute instance today, not even a basic 1vcpu, on NA and Europe-west. As another comment pointed out, what's the point of having so many zones and redundancy around the globe if such global failure can still happen? I thought the "cloud" was supposed to make this kind of failure impossible

6 comments

stevehawk 2773 days ago

This is unfortunately the norm. Like when AWS S3 went down (but couldn't update its own status images because they're in S3 and we all laughed) and along with it went Alexa, lambda, and every other service dependent on S3.

link

jorblumesea 2773 days ago

S3 is really the one of the few services on aws that can do that unfortunately. It has no concept of zone/region, it's truly global. To me it seems like a serious design flaw, as everything else in aws is striped by region, but not sure why exactly it was built like that.

edit:

nvm s3 has regions, it's the bucket names that are global.

link

dodobirdlord 2773 days ago

Buckets are globally addressable because they planned for each S3 bucket + object key to have an associated URL (actually several), and URLs are a global namespace.

http(s)://<bucket>.s3.amazonaws.com/<object> http(s)://s3.amazonaws.com/<bucket>/<object>

link

jorblumesea 2773 days ago

urls would have to be global, but why buckets themselves? It seems like a many to one relationship would easily be possible.

link

kevan 2773 days ago

Given the historical context (S3 launched 12 years ago, 5 months before EC2 launched with the us-east-1 region), it's reasonable that S3 buckets were global because regions didn't really exist yet as a concept.

If you look at the docs now[1], new buckets are regionalized and the region is in the URL for non-us-east-1 regions.

[1] https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_...

link

jorblumesea 2773 days ago

Got it, makes sense. So the only way you'd have a global bucket is if you created it before a certain date (whenever they were regional-ized, I assume). Thanks!

link

carbocation 2773 days ago

> I was unable to create any google compute instance today, not even a basic 1vcpu, on NA and Europe-west.

I've been creating GCP instances in us-central1-a and us-central1-c today without issue. Which zone were you using in NA?

I have been noticing unusual restarts, but I haven't been able to pin down the cause yet (may be my software and not GCP itself).

link

antpls 2773 days ago

Tried on us-east, us-north, europe-west, also tried asia, with different instance sizes and with both UI and CLI. None worked for me.

link

pfd1986 2773 days ago

Same here.

link

aviv 2773 days ago

Have not seen any restarts this weekend, and we have several hundred instances on GCE.

link

carbocation 2773 days ago

Thanks! I'm running Skylake 96 core instances but I haven't given up to try the 64 core instances for comparison yet. If I get another restart, I'll do a 96 vs 64 to try to narrow down the cause. Most likely, of course, this is a software issue on my end, not Google's.

link

0xbadcafebee 2773 days ago

> I thought the "cloud" was supposed to make this kind of failure impossible

You have to remember that you're trying to have access to backend platforms and infrastructure at all times, which almost no public utility does (assuming "the cloud" is "public utility computing"). Power plants go into partial shutdown, water treatment plants stop processing, etc. Utilities are only designed to provide constant reliability for the last mile.

If there's a problem with your power company, they can redirect power from another part of the grid to service customers. But some part of your power company is just... down. Luckily you have no need to operate on all parts of the grid at all times, so you don't notice it's down. But failure will still happen.

Your main concern should be the reliability of the last mile. Getting away from managing infrastructure yourself is the first step in that equation. AppEngine and FaaS should be the only computing resources you use, and only object storage and databases for managing data. This will get you closer to public utility-like computing.

But there's no way to get truly reliable computing today. We would all need to use edge computing, and that means leaning heavily on ISPs and content provider networks. Every cloud computing provider is looking into this right now, but considering who actually owns the last mile, I don't think we're going to see edge computing "take over" for at least a decade.

link

aiisjustanif 2773 days ago

> I thought the "cloud" was supposed to make this kind of failure impossible

If set up properly to be utilized correctly, yeah. But, it's not a perfect world though.

link

davemp 2773 days ago

I’ll suggest considering whether entities enamored with centralizing ideals are more likely to fail to properly realize the robustness of a distributed system.

link

aviv 2773 days ago

We have created GCE instances in several US regions without any issue today. Last one was 10 minutes ago in west2.

link