| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by evancasey 4053 days ago
	Huge spark fan here. Love the execution model, API, supporting libs etc. Unfortunately, Spark doesn't scale well on large datasets (10TB+). Sure, it's possible (and has been done), but right now there are too many rough edges to make it a better choice than Scalding/Cascading for data processing at scale. Most of this boils down to fine tuning certain Spark parameters, which is a pain when you're dealing with long-running, resource intensive workflows.

2 comments

You have any references to support that 10tb claim?

What is the upper limit you have hit with Spark?