|
|
|
|
|
by makmanalp
2901 days ago
|
|
The deployment is /way/ more simple, especially if you already have a python environment (even more so with conda), which is one of the main attractions for me. The other is that the API is way richer - spark dataframes tie your hands down in so many ways and require you to write tons of custom code for routine stuff, while pandas (and dask) has builtins for almost everything imaginable these days. The tradeoff is that spark and hadoop in general have invested /serious/ efforts into resilience, while dask only really provides protection against worker failure, but not really scheduler failure. In practice is this really an issue? shrug it depends. "Works for me". How many tasks do you run in parallel? If you're doing ad-hoc analysis on a very large dataset, dask might be a great fit. If you have a data warehouse use case and have tons of people running analytics queries, then you have uptime requirements, and spark might be a better fit. That's where I ended up. |
|