Hacker News new | ask | show | jobs
by wenc 2942 days ago
> Once you’ve sized out of pandas/scipy/scikit, your next major option is Spark, which is certainly powerful, but is also unwieldy.

There's also Dask [1], a native Python framework for distributed computations (by Anaconda). Irina Truong gave an excellent talk at PyCon 2018 about it [2]. I had never thought to look into Dask because Spark worked well for my use cases, but it has a lot of advantages over Spark (e.g. speed -- it's faster and more lightweight than PySpark and has no JVM serialization overhead) if you're using Python. Dask also runs on Kubernetes clusters, so scaling is not an issue.

And yeah, a huge amount of important data analysis work will continue to be done on data that fits in memory. Data analysis on distributed datasets is important, but from what I can tell, outside of certain domains it's certainly not the majority of the data analysis work out there.

[1] http://dask.pydata.org/en/latest/spark.html

[2] https://www.youtube.com/watch?v=X4YHGKj3V5M