|
|
|
|
|
by evancasey
4409 days ago
|
|
My experience using spark has also been nothing but positive. I recently built a similarity-based recommendation on Spark (https://github.com/evancasey/sparkler), and found it to be significantly faster than comparable implementations on Hadoop. subprotocol's point about specifying the number of tasks/data partitions to use is true - you need to manually set this in order to get good results even on a small dataset. However, other than that, spark will give you good results pretty much out of the box. More advanced features such as broadcast objects, cache operations, and custom serializers will further optimize your application, but are not critical when first starting out as the author seems to believe. |
|