Hacker News new | ask | show | jobs
by mattbillenstein 2808 days ago
AppEngine had the same problems - seemingly every week some component of the service would be down for some non-negligible amount of time (laughably it was often search -- we're talking about Google here).

I've generally found AWS more reliable than GCP - even when GCP isn't having downtime, you'll occasionally get 503's from their APIs, so you need to wrap all your calls to them in retries.

AWS has had multiple instances of cascading EBS backplane failures, but outside of that I've found their core services pretty reliable -- 400+ days of uptime on a lot of VMs in systems I've worked on -- I avoid EBS when I can.

My advice is to keep your stuff simple - PaaS might seem attractive, but you have so little control as you mention when something goes down. Embrace multi-cloud by using the lowest common denominator of tech available - virtual machines, dns, networking, and instance storage if that suits your needs. Treat vms as disposable - and make sure you have system, service, and data redundancy at that level to survive the failure of an entire availability zone across your application.

6 comments

AppEngine had some big failures early on, but I (and some friends) built a $$$$$$$$$$$$ company on AppEngine (and GCP) and couldn't have done it without it. The stability the last few years has been extremely good. Our base logic was that we trust Google to hire and train talented DevOps more than we can do it and it sure sucks carrying a pager.
Snapchat?

If your app is that big, someone is always carrying a pager for when there are problems. The difference is on PaaS, you can't do a damn thing about it if it's a problem with the platform.

I've helped multiple companies get off of app engine because even for companies losing money (startups), it's too unreliable -- and actually very slow (datastore) if your app is relational. Also, it's very very expensive if you hit the datastore hard.

Not quite, but first MVP in 3 months and $80m gross revenue in the first year. Selling t-shirts. We did it with 3 engineers, no devops or qa teams and definitely no pagers. We had zero downtime and the very rare bugs were fixed on the next push to master (CI/CD) and real testing.

I'm not saying the datastore is perfect, but using the datastore has well known and predictable limitations that need to be engineered for. It is definitely not something you can RTFM later on. Just like any database to be honest. It is not a relational database. It doesn't do aggregations. It is for storing data (using Objectify [0]) and memcache is for caching that data.

[0] https://github.com/objectify/objectify

>you'll occasionally get 503's from their APIs, so you need to wrap all your calls to them in retries.

No matter which cloud platform you're using you should do this[1]. I'm not familiar with the GCP SDK but I know the AWS SDK has it built in[2]. If you're not using the SDK then you have to build it yourself. There will always be a small percentage of transient errors due to the network, DNS, timeouts, hardware failure, etc.

[1] This is a blanket generalization, there are some situations where you shouldn't use the backoff/retry pattern even for retryable errors.

[2] https://docs.aws.amazon.com/general/latest/gr/api-retries.ht...

Pretty amazing how virtualization has come to the point that we even need to virtualize our reliance on cloud vendors across multiple vendors to ensure realiability
I'll say, I've only really done multi-cloud for cases where I liked different products on different clouds -- the main app and data on AWS, but using Google's data stack (BigQuery >> Redshift imho).

In terms of reliability, I think the first step is multi-region -- being able to failover to another region should your primary region have major failage. But assuming you can do that, doing multi-cloud for the same thing shouldn't be so hard provided you have some sort of common open source runtime to run on both platforms.

Appreciate the thoughtful reply and summary of experiences with GCP vs AWS.
Retrying localizable errors (such as 503 Service Unavailable) should be universally practiced in any RPC scheme. Nobody can make a backend that's 100% reliable.
AWS suffers from the same status page gaslighting though.