Hacker News new | ask | show | jobs
by gshayban 3824 days ago
Many people blindly point to the docs to say "don't use groupBy, prefer reduce because it's faster..." Are there better examples that illustrate the fundamental differences between the two operations? Surely there is still a need for both operations
2 comments

Reduce can perform reductions on locally on each machine before shuffling the data. This decreases the memory as well as the network overhead. If you need all the elements for a given key - e.g. to display them to a user or save them to a DB, perhaps you should use groupBy. If you're going to perform some form of a reduce after that though, it's likely sub-optimal.
databricks has a page that describes the pitfalls: https://databricks.gitbooks.io/databricks-spark-knowledge-ba...

I don't know if the OutOfMemory exception can still occur in recent versions of Spark, but the performance impact of groupByKey is very real.