Hacker News new | ask | show | jobs
by pjbringer 4651 days ago
In that case, your IO problem is easy, because it's small with regard to CPU time. You can get away with putting the whole data in sql databases, and/or making multiple copies of your data. Then you can use as many workers as you want, with usually simple partitioning logic.
2 comments

Originally I did just that, but ultimately decided to move to Hadoop. When combined with Amazon EMR, launching arbitrarily large cluster is just a few clicks. You can then monitor progress, have robust cluser-wide error handling, and your data gets nicely merged into output files in S3 (not so easy with the home-baked solution).
We've had a lot of success with EMR as well - we have an hourly Pig job that produces data for our analytics database. It's not a particularly complex script, but our traffic volume is unpredictable so it's reassuring to know that we can add resources to a slow job and have it finish faster.

The downside of EMR is that it can be fairly expensive once you start needing the beefy machines. We're lucky that we can afford to have our analytics delayed an hour or two and can thus run on Spot instances (except for the Master node). When we move to a streaming architecture I'm not sure EMR will still be competitive, since we won't be able to have those machines go away on us.

Edit: clarity.

If you only look at CPU time, then yes, maybe you could do that. But there are many more factors at play.