Hacker News new | ask | show | jobs
by evancasey 4053 days ago
Huge spark fan here. Love the execution model, API, supporting libs etc.

Unfortunately, Spark doesn't scale well on large datasets (10TB+). Sure, it's possible (and has been done), but right now there are too many rough edges to make it a better choice than Scalding/Cascading for data processing at scale. Most of this boils down to fine tuning certain Spark parameters, which is a pain when you're dealing with long-running, resource intensive workflows.

2 comments

You have any references to support that 10tb claim?
What is the upper limit you have hit with Spark?