|
|
|
|
|
by lgieron
4654 days ago
|
|
Originally I did just that, but ultimately decided to move to Hadoop. When combined with Amazon EMR, launching arbitrarily large cluster is just a few clicks. You can then monitor progress, have robust cluser-wide error handling, and your data gets nicely merged into output files in S3 (not so easy with the home-baked solution). |
|
The downside of EMR is that it can be fairly expensive once you start needing the beefy machines. We're lucky that we can afford to have our analytics delayed an hour or two and can thus run on Spot instances (except for the Master node). When we move to a streaming architecture I'm not sure EMR will still be competitive, since we won't be able to have those machines go away on us.
Edit: clarity.