| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by phamilton 1203 days ago

We process around 50M sidekiq jobs a day across a few hundred workers on a heavily autoscaled infrastructure.

Over the past week there were 2 jobs that would have been lost if not for superfetch.

It's not a ton, but it's not zero. And when it comes to data durability the difference between zero and not zero is usually all that matters.

Edit for additional color: One of the most common crashes we'll see is OutOfMemory. We run in a containerized environment and if a rogue job uses too much memory (or a deploy drastically changes our memory footprint) the container will be killed. In that scenario, the job is not placed back into the queue. SuperFetch is able to recover them, albeit with really lose guarantees around "when".

1 comments

ZephyrBlu 1203 days ago

Let me get this straight, you're complaining about eight 9s of reliability?

50,000,000 * 7 = 350,000,000

2 / 350,000,000 = 0.000000005714286

1 - (2 / 350,000,000) = 0.999999994285714 = 99.999999%

> It's not a ton, but it's not zero. And when it comes to data durability the difference between zero and not zero is usually all that matters.

If your system isn't resilient to 2 in 350,000,000 jobs failing I think there is something wrong with your system.

link

phamilton 1203 days ago

This isn't about 2 in 350,000,000 jobs failing. It's about 2 jobs disappearing entirely.

It's not reliability we're talking about, it's about durability. For reference, S3 has eleven 9s of durability.

Every major queuing system solves this problem. RabbitMQ uses unacknowledged messages which are pinned to a tcp connection, so when that connection drops before acknowledging them they get picked up by another worker. SQS uses visibility timeouts, where if the message hasn't been successfully processed within a time frame it's made available to other workers. Sidekiq free edition chooses not to solve it. And that's a fine stance for a free product, but just one I wish was made clearer.

link

ZephyrBlu 1203 days ago

If you want to focus on durability then I think your complaint makes even less sense. Somehow I doubt S3 is primarily backed by Redis.

I think it's fair to assume that something backed by Redis is not durable by default because that's not what Redis is known for, whereas the other options you listed are known for their resiliency and durability. I wouldn't view Sidekiq as a similar product to RabbitMQ and SQS.

Also, Sidekiq Pro uses more advanced Redis features to enable super_fetch lending to the assumption that by default Redis is not durable: https://www.bigbinary.com/blog/increase-reliability-of-backg....

link