|
|
|
|
|
by wenc
2942 days ago
|
|
> Once you’ve sized out of pandas/scipy/scikit, your next major option is Spark, which is certainly powerful, but is also unwieldy. There's also Dask [1], a native Python framework for distributed computations (by Anaconda). Irina Truong gave an excellent talk at PyCon 2018 about it [2]. I had never thought to look into Dask because Spark worked well for my use cases, but it has a lot of advantages over Spark (e.g. speed -- it's faster and more lightweight than PySpark and has no JVM serialization overhead) if you're using Python. Dask also runs on Kubernetes clusters, so scaling is not an issue. And yeah, a huge amount of important data analysis work will continue to be done on data that fits in memory. Data analysis on distributed datasets is important, but from what I can tell, outside of certain domains it's certainly not the majority of the data analysis work out there. [1] http://dask.pydata.org/en/latest/spark.html [2] https://www.youtube.com/watch?v=X4YHGKj3V5M |
|