Hacker News new | ask | show | jobs
by ReidZB 2371 days ago
I think it's still a reduction in risk overall. In the old model, they were vulnerable to S3 failing in one region, a thing that's happened many times. Now they've mitigated the S3-failure-in-one-region issue, at least mostly (though as you point out, how they do so is unknown), and in exchange they've picked up a dependency on Lambda@Edge. But Lambda@Edge, like CloudFront, is a global service distributed across many regions, and to my knowledge AWS has never had a global Lambda@Edge outage.

It's not impossible, of course. Some kind of control plane error could probably knock the whole global service offline. But I'd rather bet on a multi-region service than have all my eggs in one regional basket.

1 comments

The most famous s3 outage has been operator error from a well-meaning privileged user. The fact that it hasn’t happened for Lambda is just betting on luck. Shit happens, we can’t go designing ever more complicated solutions. May be our services should have some graceful degradation when shit happens instead of trying to create a big-bang and spawn an alternate universe.
> The fact that it hasn’t happened for Lambda is just betting on luck.

Cellular Architecture was largely a reaction to the S3 outage [0]. I agree that one is still bound to fail due to unknown unknowns or unpatchable known unknowns, but reducing the blast radius [1] to not be globally unavailable [2] is a step in the right direction.

[0] https://www.youtube-nocookie.com/embed/swQbA4zub20

[1] https://blog.acolyer.org/2016/09/12/on-designing-and-deployi...

[2] https://blog.acolyer.org/2015/05/07/large-scale-cluster-mana...

Clever marketing term btw: what’s old is new.

‘Cellular architecture’ is how anyone not going down during their prior outages was doing it for over a decade, just not cleverly branded.

Good links, showing base ideas getting published half a decade ago. I’ve seen use for at least 15 - 20 years, pre-dating ec2 and AWS.

I mean, I agree in spirit, but everyone has a different sense of cost/complexity vs. return.

I don't advocate for ever-more-complicated solutions as a rule. e.g. I think multi-cloud setups are probably way more trouble than they're worth for most companies.

I certainly agree that graceful degradation where possible and not too expensive is ideal. For example, if S3 is having problems in one region, being able to fall back (gracefully degrade) into read-only mode might be a nice thing to have.

(In this particular case having a secondary region also probably helps with disaster recovery, which is pretty much mandatory in B2B, for better or worse.)

I completely agree. If building a read-only fallback would require a lot of engineering and added a lot of complexity I would also say it is overkill, but as this solution doesn’t (happy to argue about that). It was an acceptable tradeoff for us as we already replica the underlying s3 buckets for disaster recovery as you already pointed out.

We also run our underlying Content Delivery APIs in two AWS regions so this was a logical extension.

If the added complexity is worth for your use-case can only be decided by you and I hope the article provided some guidance around that vs. just being a copy & paste gist.

Source: I work at Contentful.