Spark is intended to work on big datasets. It's machine learning capability is very limited and it's primary strength is processing huge amounts of data. I think it's unfair to blame it for 'failing' on small datasets
I agree that Spark may be a fine choice for ETL and generic pipeline tasks.
But lots of companies will choose it as a data warehouse computation layer and then enforce a policy to standardize everything around it, including tasks like machine learning that are poorly suited for Spark.
Worse, companies like Databricks will encourage this standardization and act like yes-man consultants, promising Spark ML offerings can solve all the problems, and you quickly end up with some brittle monster of a data warehouse system that is oriented to be convenient for Spark (which can’t effectively be used to solve the problems) and everything is deeply inconvenient to pipe to non-Spark systems, and nobody is sympathetic to any budgetary needs for other systems, since they spent it all on Spark.
But lots of companies will choose it as a data warehouse computation layer and then enforce a policy to standardize everything around it, including tasks like machine learning that are poorly suited for Spark.
Worse, companies like Databricks will encourage this standardization and act like yes-man consultants, promising Spark ML offerings can solve all the problems, and you quickly end up with some brittle monster of a data warehouse system that is oriented to be convenient for Spark (which can’t effectively be used to solve the problems) and everything is deeply inconvenient to pipe to non-Spark systems, and nobody is sympathetic to any budgetary needs for other systems, since they spent it all on Spark.