Hacker News new | ask | show | jobs
by chr15 3965 days ago
Is there any really way to design your application to handle S3 failures like this? S3's SLA has 99.99% availability, but is there a way to handle the 1% so your application is not affected? Options I can think of:

  1. Using a CDN to serve files can help in some cases
  2. On-prem systems may be able to use gateway-cached volumes and use the local disk cache vs S3
Other ideas?
8 comments

Slightly OT, but there's an interesting phenomenon at work now that so much of the internet depends on Amazon's infrastructure. When it goes down you might not even need to worry about it that much, as so many sites/apps will be broken that most users will just assume that the internet is broken.
It happened a few times before. And no Internet was not broken and not many sites are using AWS as you would assume. A lot of services still run on their datacenters or on-premise servers, or maintain a hot backup that they can switch to immediately.

>> When it goes down you might not even need to worry about it that much

I'm afraid this is pretty much getting the entire cloud thing wrong.

lot of services still run on their datacenters or on-premise servers, or maintain a hot backup that they can switch to immediately.

I'm not an idiot, I'm well aware of that. My point is that when a large number of consumer-facing sites go down, users (who aren't aware of Amazon cloud servers) simply assume something is wrong with the internet.

Obviously if you have a mission critical service this isn't acceptable. But for a lot of average sites/apps it might not be worth the investment in time/effort to cover relatively small outages such as these.

I'm afraid this is pretty much getting the entire cloud thing wrong.

Not really. It's the utility of the cloud - if there's an outage there are already a lot of people working to fix it. If you're self-hosted, that's on you.

The first thing I do when Netflix goes down is complain on Facebook. Facebook up, Netflix down? It's not the Internet that's broken.
> I'm not an idiot

Nobody called you an idiot.

> I'm afraid this is pretty much getting the entire cloud thing wrong.

This was in reference to your comment "most users will assume the internet is broken". In the context of "it's OK because the customer thinks everything is down" the comment would be completely appropriate. It's being done wrong if this is the way we approach things.

Similarly, a lot of users will call tech support to complain that "the Internet is down" when it's merely their email provider, or Facebook, or League of Legends, and everything else is fine... :-)
Which works if you fail to read-only, and simply switch to reading from the non-failed region. More complicated is attempting to continue writing to a non-failed region and then attempting to bring everything consistent once the failed region is available again.

EDIT: If you were using hashing/uuids/guids for objects, this should be possible using a background task that'll scan various buckets in multiple regions and move objects when/where necessary to return to a consistent state.

Multi Cloud. Use S3 + Azure Cloud Files + Google Cloud Files
S3 SLA is actually for only three nines (99.9%) or 8.76 hours / year or 43.8 minutes / month of downtime: https://aws.amazon.com/s3/sla/

CloudFront offers the same availability. Many CDNs offer no more than three nines. Some claim 100%, but there will eventually be faults. Most do really well to not have recognized outages, but I nonetheless think they offer 100% to guarantee so that you always get credit for any downtime rather than guaranteeing they are never down.

You can look at replicating files to multiple providers; the following shows what kind of uptimes you can expect from the big players: https://cloudharmony.com/status-1year-of-storage

If you can live with read-only states with CDN; a similar report: https://cloudharmony.com/status-1year-of-cdn

Understand that there isn't 'An S3 service'. There are multiple S3 services in multiple Regions within AWS, and they're all operated independently of each other (this goes for all other AWS services too) so that cascading failures/etc. don't occur between regions. So, use 2-3 different S3 regions, or some other multi-cloud solution...
How independently? Like independent implementations of the software? They just implement the same interface?
As in, independent installations/configurations of the same service...
Is that enough?
Second this, I would love to hear how companies handle S3 outages. Although one correction, it's 0.01% that you need to handle (if they deliver on their availability promise). That's less than an hour a year.
replicate to different data center e.g. Azure.
I wonder, what's the yearly downtime of Amazon? If S3 were up 99.99% of the time then the remainder .001% is only 5.256 minutes per year. S3 is actually down more than that. But how much exactly? It's impossible to discuss mitigation strategies if we don't even know what the exact issue we're mitigating is!
Your math is off for four nines; 100% - 99.99% = 0.01%, which is just over 52 minutes per year.

Also fun: https://en.wikipedia.org/wiki/High_availability#Percentage_c...

The way it works is that they are promising 99.99%. If it is lower you will be reimbursed for that month (if I remember correctly), but only when you complain to them about it.