Hacker News new | ask | show | jobs
by colmmacc 3286 days ago
I work at AWS and I think there's definitely some similarities and differences. We do share our backbone with CloudFront, and hence our video traffic, of which there's quite a lot these days. We also advertise our network ranges broadly, it's our mission to carry the traffic as much as possible ourselves. So those aspects are very similar.

But a genuine difference is that we don't try to operate a global "seamless" network. The reason is that we optimize for the "A" in CAP. Our experience is that at the low-ish level of a network, it can be too easy for outages and availability issues to spread quickly. For example, with global networking then a misconfiguration or error can more easily propagate globally and bring everything down.

Instead, we have autonomous uncoupled regions and it's one of our core principles that faults and errors stay within these regions (or better yet, availability zones). That does mean that partitions can happen, but find that most customers use active-standby configurations (where it makes no real difference) for key data, and we also build the tools that work with partitionable networks at a higher level. For example Route 53 supports multi-region routing and failover, and does it measurably better than simple anycast routing can achieve.

Over time, we're offering more and more multi-region services, such as cross-region replication for data, but the coordination is done at higher levels where we can achieve higher levels of availability in simpler ways, built on top of a more solid foundation.

2 comments

Yeah, at least one outage in the last year for us included the paraphrased "anycast is hard". There's definitely an advantage to limited-blast-radius services, but I wouldn't trade GCS's multiregional buckets for S3 CRR (the same applies to Datastore, etc.). We have strict reliability requirements for global services; our perspective is that some customers want regional control for regulatory reasons, and we're happy to meet those requirements. But composing a global or even multi-regional service on an untrusted, lossy network is "crazy".

Again, Disclosure: I work on Google Cloud (and this network existed when I got here!)

>For example, with global networking then a misconfiguration or error can more easily propagate globally and bring everything down.

This sounds like Nassim Taleb's antifragile meme [1].

If I was running IT for some large enterprise (which I'm not!) then I might replicate services on both AWS and Google Cloud. A bit like Apple try to have more than one supplier for their hardware components.

[1] https://en.wikipedia.org/wiki/Antifragile

You don't always need to have enterprise $$$ to take an antifragile approach though.

I'm mirroring my private Git repositories between Gitlab.com and Bitbucket, which can be done for $0.

I might even end up paying for the bottom end Github.com ($7 per month) and Gitlab ($4 per month) accounts and have three way redundancy.