Hacker News new | ask | show | jobs
by verdverm 1384 days ago
This was the exact problem that MapReduce solved. Once you reach a large enough scale (which isn't all that big), you are guaranteed to have failed jobs and a need to recover.