| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mnm1 2721 days ago
	Workers and queues fail. SQS was down for us for almost two weeks while AWS fixed a bug. We had no choice but to wait or rewrite our implementation... Again! We've already had to rewrite once due to poor visibility and rare occasional problems processing data. Debugging such distributed systems is legendary hell. And that's just for simple async processing so that we can return a response quickly to the user and finish the task in a few seconds. There is simply no comparison between such a complex, failure prone distributed system and the simplicity, reliability, and ease of use of having support built into the language for this, IMO.

1 comments

StreamBright 2721 days ago

I am sorry but I disagree. You are trying to make it sound that your cloud provider downtime has something to do how you manage your workload in your code.

Debugging __any__ distributed system is difficult, this is why monitoring and tracing should be first class citizens in your deployments. It seems they are not for you.

link

mnm1 2720 days ago

Yeah, monitoring told us it was down and eventually we figured it was an AWS issue we could do nothing about until they patched it. My main point there is actually that for many use cases, this doesn't have to be a distributed computing problem and thus the non-distributed version is superior to the distributed version.

link