| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by morelisp 1135 days ago
	My gut feeling is that if your only AZ goes down (or all your AZs simultaneously), you're going to lose data period because your producers are now all stuck, your APIs are unavailable, etc. Whether the data loss begins at the exact moment power failed or a couple minutes before doesn't matter, vs. the additional cost to fsync constantly. I mean it's good to know all the failure modes, but at the end of the day it's also good to know how much handling them will cost, and it's often not worth it.

2 comments

frant-hartm 1135 days ago

This is very practical way of looking at the problem and is true for majority of systems, but anyone serious enough about keeping their data, and not just pretending, has some kind of back pressure mechanism built in, so the messages will stop flowing if they can't be processed.

link

morelisp 1135 days ago

Right, and best case that’s going to come back as 503s or 429s, and if that continues for any length of time your customers are going to view it as morally equivalent to data loss (or maybe worse, if the response has no reason for them to be tied to some event stream).

link

dilyevsky 1135 days ago

producers stuck != data loss (if you use transactional commits at least). If you run in multiple regions you dont need multi az in a lot of architectures

link

morelisp 1135 days ago

I don't mean because of some misfeature in the Kafka protocol, I mean because events are still coming in but have nowhere to go. Unless you built a spill as wide as your Kafka cluster. Which isn't worth it, so no one does it.

link