|
We use dask heavily, along with rest of the pydata ecosystem. I guess we are in the 'sweet spot' where data doesn't fit memory, to begin with, but once we perform any filtering and aggregations, switch over to pandas. That's exactly what dask recommends too. Our datasets don't exceed 100GB right now. Also note, dask clearly acknowledges challenges dealing with data in the terabyte range https://coiled.io/blog/dask-as-a-spark-replacement/ Most of our use-cases right now involve using multiple cores of a big instance, than resorting to cluster computing. With spark, there is additional/steep learning curve, complexities of dealing with cluster computing. And Spark-ML is not well known. With dask/pandas it's easy enough to feed scikit-learn and/or bring in dask-ml, just a pip install, and you can scale well known sklearn modules effectively. I think in the end, it's about keeping things simple. As others said, if you are already invested big in Spark/Scala/Hadoop, that may make sense for you. For non-CS folks, this will be a challenge. As for vaex, it's very interesting. One issue is that it seems to be able to want hdf5 and doesn't want to work with parquet. And it's API is not fully compatible with pandas. Ray/Modin: played with it a bit and maybe it's a bit too new for enterprise uses and may be more geared for ML workloads. That's my take anyway and it may have progressed substantially, already. |