Hacker News new | ask | show | jobs
by gandalfu 4117 days ago
I believe one of the key advantages of Spark over Hadoop is being able to run the full stack on a small environment (single machine) and do all the coding there without the need of a cluster just for development.
3 comments

This is why I like cascading [1]. It has a higher level API on Hadoop. It also works in local mode with little change. I've actually used out to do transformation work from local files (csv), join them into structured documents and dump them into ArangoDB. I liked it so much I wrote a third party library to work ArangoDB in Hadoop[2].

1 cascading.org/ 2 https://github.com/deusdat/guacaphant

This is a big sell for me. There are some small catches to it - for example, for operations requiring an associative function (such as reduceByKey), the need for it to be associative may not arise until the data-set becomes suitably large to be split across multiple workers, so in my team our testers specifically check reduce function associativity has been demonstrated in a unit test.
You can also run mapreduce in local mode on a single machine.