Hacker News new | ask | show | jobs
by Fiahil 1112 days ago
Thanks you for your responses !

I'm going to be very blunt here, because you need to hear this to go forward :

You HAVE TO be at least API-compatible with Polars or Pandas to exist. Being backend-compatible with arrow is not enough.

There is no technical reason why you would not pick one and go with it, apart from being a very difficult task.

As of today, I have 2 major pains : Pandas being a giant memory hog and Polars not being a drop-in replacement for Pandas. I am pushing Polars, as hard as I can in all projects I can touch, and it's a very long way from being the default DataFrame library. Data Scientists will continue to use Pandas for the foreseeable future, and that saddens me greatly because I will also have to work with OOMKilled pipelines for the foreseeable future.

There is no place for a 3rd alternative, so either you become a "distribution bridge for Polars", and that would be absolutely amazing. Or, you go your own way, I'll put a small star on github, a "Noice!" in the comment section and move on and never come back.

It's tough, but sadly real.

1 comments

Hi! (one of the Daft maintainers here), thanks for the feedback. Ultimately you're right that supporting the full Polars syntax in a distributed fashion is very difficult. There are libraries out there that do "Pandas but distributed" but from what I have seen is that they prioritized API coverage rather than performance or memory consumption. So you end up in a similar boat to the situation you mentioned.

We're trying to start with a simpler API that maps well to a distributed query query that we can execute well and then add the features that people request for.

I would love to know what you would want to see in Daft!

Then, maybe the right choice isn't to start a fresh DataFrame library from arrow, but rather leverage Polars and build out the distributed part (in Rust, of course, not in Python).

> We're trying to start with a simpler API that maps well to a distributed query query that we can execute well and then add the features that people request for.

That would have been a good approach on a field that has not been standardised around a single library since its infancy. Polars is beating Pandas in every possible benchmark, yet will continue to struggle for adoption "until the end". Do you really think Daft can do better ? (If yes, go ahaid, and prove me wrong !)

As a comparison, it's like trying to introduce a new transport layer protocol (https://en.wikipedia.org/wiki/QUIC) against TCP. You can do that if and only if there are obvious benefits, no drawbacks and you are prepared to wait 15 years for 30% market share.