| HN Mirror

I've been in the industry for 10+ years. I've worked with everything from telco metrics firehoses to bank customer event streams to deep learning.

The venn intersection of conditions where spark makes sense is really rather narrow. A single high spec instance running leaner tooling will generally meet one's requirements while blowing spark out of the water in terms of perf and cost.

Operationally, spark is a huge PITA, hence databricks and a host of other offerings, I guess including this one, to try to manage the pain. Meanwhile something like dask-kubernetes will cater to the same use case with significantly lower operational complexity and again much higher perf and cost efficiency.

I can't really think of a scenario where I'd choose to use spark on a greenfield project today.