| HN Mirror

I agree that Spark may be a fine choice for ETL and generic pipeline tasks.

But lots of companies will choose it as a data warehouse computation layer and then enforce a policy to standardize everything around it, including tasks like machine learning that are poorly suited for Spark.

Worse, companies like Databricks will encourage this standardization and act like yes-man consultants, promising Spark ML offerings can solve all the problems, and you quickly end up with some brittle monster of a data warehouse system that is oriented to be convenient for Spark (which can’t effectively be used to solve the problems) and everything is deeply inconvenient to pipe to non-Spark systems, and nobody is sympathetic to any budgetary needs for other systems, since they spent it all on Spark.