Hacker News new | ask | show | jobs
by deshpand 1668 days ago
We use dask heavily, along with rest of the pydata ecosystem. I guess we are in the 'sweet spot' where data doesn't fit memory, to begin with, but once we perform any filtering and aggregations, switch over to pandas. That's exactly what dask recommends too. Our datasets don't exceed 100GB right now.

Also note, dask clearly acknowledges challenges dealing with data in the terabyte range https://coiled.io/blog/dask-as-a-spark-replacement/

Most of our use-cases right now involve using multiple cores of a big instance, than resorting to cluster computing.

With spark, there is additional/steep learning curve, complexities of dealing with cluster computing. And Spark-ML is not well known. With dask/pandas it's easy enough to feed scikit-learn and/or bring in dask-ml, just a pip install, and you can scale well known sklearn modules effectively.

I think in the end, it's about keeping things simple. As others said, if you are already invested big in Spark/Scala/Hadoop, that may make sense for you. For non-CS folks, this will be a challenge.

As for vaex, it's very interesting. One issue is that it seems to be able to want hdf5 and doesn't want to work with parquet. And it's API is not fully compatible with pandas.

Ray/Modin: played with it a bit and maybe it's a bit too new for enterprise uses and may be more geared for ML workloads. That's my take anyway and it may have progressed substantially, already.

1 comments

If Dask doesn't consider their distributed version even on a single node (which is what we were using) to be production ready then they should label it as such.
Do you have any citation on why "Dask doesn't consider their distributed version" to be ready? If it is your own view, then that's ok.

I think dask is in heavy usage in real production systems. Let me cite one such usage here, from Capital One (no affiliation, just referencing a big bank for 'production ready' purposes) https://www.capitalone.com/tech/machine-learning/dask-and-ra... (also not necessarily suggesting any rapids/GPU usage, you can decouple it from the article)

And note the article is from Nov 2019. Two years is a substantial amount of time for further improvements.

You post seemed to argue that Dask is fine is you stick to relatively small data that fits on a single node then switch to pandas. You also noted that this is what Dask recommends. Implication being that I ran into issues because I didn't use Dask "the right way."

I don't see how you can argue both that and that dask distributed is production ready at the same time.

I've been in big data for 15 years and was probably one of the first few thousand production hadoop users. If you think "a big company used a big data tech so it's production ready" is an argument then I've got a few bridges to sell you. A lot of companies use a lot of technologies that they spend a lot of time beating into a shape where for their specific use cases they work just well enough to not get them all fired.

In the end, it's not about an OPEN SOURCE tool being perfect but whether it is helping you solving a problem. If it did not help you and YOU don't consider it production ready, then that's fine. But you seem to argue that Dask should put this disclaimer out there. That would imply that many other open source tools including Spark would have to do it.

Dask has solved specific problems for us and we are grateful about it. I remain open minded about other choices and listed them with the understanding I have about them.

Switching to pandas when you can is going with the philosophy of keeping things simple. I like the flexibility of going back and forth between these as and when I choose.