Hacker News new | ask | show | jobs
by qaq 2987 days ago
"but we have lost machines, connectivity, seen kernel panics, EBS failures, SSD failures, etc., your usual day in AWS " <=== This I wish more people realized that is a day to day reality if you are in AWS at scale.
3 comments

The best way I've heard it described is "complex systems run in degraded mode".

https://cdn.chrisshort.net/How-Complex-Systems-Fail.pdf

Basically once a system is complex enough some part if it is always broken. The software must be designed from the assumption that the system is never running flawlessly.

No doubt but that's pretty high overhead for many projects colo is actually a decent choice but I guess that's not a popular opinion.
As I understand it, it's like that everywhere at scale, not just on AWS, it being a property of operating at scale.

Or are you saying that AWS is particularly unreliable at scale?

I seem to think that cloud providers are particular opaque about small glitches (i.e. they aren't going to tell you that a router or switch was rebooted for maintenance if it comes back right away and you can email support and it's always the same response: "it's working right now") :)
On the network side no, it's much more crappy on AWS.
Which provider is the best, network wise?
I only have experience with AWS and on prem and high quality colo like Equinix. Possibly due to reduced complexity and having full control over networking setup but significantly fewer issues vs AWS.
And FoundationDB has held our data durable through all of this.
Sounds like however bolts on PG compatible SQL layer on top will have a killer product on their hands :)
Have a look at CockroachDB
Already playing with it but FoundationDB is used for production Petabyte scale deployments, and the whole deterministic simulation thing for testing is really reassuring as far as bugs/stability. I am guessing with Apple's resources that approach was taken to a whole new level after the acquisition?