Hacker News new | ask | show | jobs
by ganeshkrishnan 2668 days ago
Spark is intended to work on big datasets. It's machine learning capability is very limited and it's primary strength is processing huge amounts of data. I think it's unfair to blame it for 'failing' on small datasets
1 comments

I agree that Spark may be a fine choice for ETL and generic pipeline tasks.

But lots of companies will choose it as a data warehouse computation layer and then enforce a policy to standardize everything around it, including tasks like machine learning that are poorly suited for Spark.

Worse, companies like Databricks will encourage this standardization and act like yes-man consultants, promising Spark ML offerings can solve all the problems, and you quickly end up with some brittle monster of a data warehouse system that is oriented to be convenient for Spark (which can’t effectively be used to solve the problems) and everything is deeply inconvenient to pipe to non-Spark systems, and nobody is sympathetic to any budgetary needs for other systems, since they spent it all on Spark.