| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by qaq 2987 days ago
	"but we have lost machines, connectivity, seen kernel panics, EBS failures, SSD failures, etc., your usual day in AWS " <=== This I wish more people realized that is a day to day reality if you are in AWS at scale.

3 comments

Joeri 2985 days ago

The best way I've heard it described is "complex systems run in degraded mode".

https://cdn.chrisshort.net/How-Complex-Systems-Fail.pdf

Basically once a system is complex enough some part if it is always broken. The software must be designed from the assumption that the system is never running flawlessly.

link

qaq 2979 days ago

No doubt but that's pretty high overhead for many projects colo is actually a decent choice but I guess that's not a popular opinion.

link

koide 2986 days ago

As I understand it, it's like that everywhere at scale, not just on AWS, it being a property of operating at scale.

Or are you saying that AWS is particularly unreliable at scale?

link

panghy 2986 days ago

I seem to think that cloud providers are particular opaque about small glitches (i.e. they aren't going to tell you that a router or switch was rebooted for maintenance if it comes back right away and you can email support and it's always the same response: "it's working right now") :)

link

qaq 2986 days ago

On the network side no, it's much more crappy on AWS.

link

koide 2986 days ago

Which provider is the best, network wise?

link

qaq 2986 days ago

I only have experience with AWS and on prem and high quality colo like Equinix. Possibly due to reduced complexity and having full control over networking setup but significantly fewer issues vs AWS.

link

killertypo 2987 days ago

And FoundationDB has held our data durable through all of this.

link

qaq 2987 days ago

Sounds like however bolts on PG compatible SQL layer on top will have a killer product on their hands :)

link

socceroos 2987 days ago

Have a look at CockroachDB

link

qaq 2987 days ago

Already playing with it but FoundationDB is used for production Petabyte scale deployments, and the whole deterministic simulation thing for testing is really reassuring as far as bugs/stability. I am guessing with Apple's resources that approach was taken to a whole new level after the acquisition?

link