|
|
|
|
|
by phamilton
1156 days ago
|
|
We process around 50M sidekiq jobs a day across a few hundred workers on a heavily autoscaled infrastructure. Over the past week there were 2 jobs that would have been lost if not for superfetch. It's not a ton, but it's not zero. And when it comes to data durability the difference between zero and not zero is usually all that matters. Edit for additional color: One of the most common crashes we'll see is OutOfMemory. We run in a containerized environment and if a rogue job uses too much memory (or a deploy drastically changes our memory footprint) the container will be killed. In that scenario, the job is not placed back into the queue. SuperFetch is able to recover them, albeit with really lose guarantees around "when". |
|
50,000,000 * 7 = 350,000,000
2 / 350,000,000 = 0.000000005714286
1 - (2 / 350,000,000) = 0.999999994285714 = 99.999999%
> It's not a ton, but it's not zero. And when it comes to data durability the difference between zero and not zero is usually all that matters.
If your system isn't resilient to 2 in 350,000,000 jobs failing I think there is something wrong with your system.