| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mnutt 215 days ago
	> we never had any issues, because we didn't depend on calling AWS APIs to continue operating. Things already running continue to run. I think it was just luck of the draw that the failure happened in this way and not some other way. Even if APIs falling over but EC2 instances remaining up is a slightly more likely failure mode, it means you can't run autoscaling, can't depend on spot instances which in an outage you can lose and can't replace.

1 comments

0xbadcafebee 214 days ago

> it means you can't run autoscaling, can't depend on spot instances which in an outage you can lose and can't replace

Yes, this is part of designing for reliability. If you use spot or autoscaling, you can't assume you will have high availability in those components. They're optimizations, like a cache. A cache can disappear, and this can have a destabilizing effect on your architecture if you don't plan for it.

This lack of planning is pretty common, unfortunately. Whether it's in a software component or system architecture, people often use a thing without understanding the implications of it. Then when AWS API calls become unavailable, half the internet falls over... because nobody planned for "what happens when the control plane disappears". (This is actually a critical safety consideration in other systems)

mnutt 214 days ago

Sure, you can only use EC2, not use autoscaling or spot and instead just provision to your highest capacity needs, and not use any other AWS service that relies on dynamo as a dependency.

We still take some steps to mitigate control plane issues in what I consider a reasonable AWS setup (attempt to lock ASGs to prevent scale-down) but I place the control plane disappearing on the same level as the entire region going dark, and just run multi-region.