Hacker News new | ask | show | jobs
by jaychia 1103 days ago
> So Daft is a distributed Polars ?

We did actually start by using Polars as our underlying execution engine, but eventually transitioned off to our own Rust Table abstraction to better suit our needs (e.g. custom datatypes and kernels). We still share the arrow2 dependency with Polars for in-memory representation of our data.

> what's the "killer feature" of Daft for trying to compete with Polars (and Pandas)

We don't specifically try to compete on a local machine since there is so much good new tooling being made recently available (DuckDB, Polars etc). That being said, we do try our best to make the local experience as seamless as possible because we've all felt the pain of developing locally in PySpark. Aside from the ability to go distributed, I like using Daft for:

* Working on more "complex" datatypes such as URLs, images etc

* Working with "many [Parquet/CSV/JSON] files in cloud storage (S3)" which we've found to be quite common for many workloads. We already have some and are building more intelligent optimizations here such as column pruning, predicate pushdowns etc to reduce and optimize I/O from the cloud.

As you've pointed out, one of our main responsibilities is to handle memory very very well. This is something we're actively working on and I'm thinking this will be a big reason to use us locally as well!

> Are they API-compatible?

We are not API-compatible with Pandas/Polars, but our API is quite inspired by Polars. We found that building out the core set of dataframe functionality was much more tractable than attempting to go API-compatible from the get-go.

> How's the memory consumption benchmark ? (TBH, this is the only interesting metric. Timing and Latency are not really important when your most important competitor is Spark.

We think throughput is still important when comparing against Spark, since this can save a lot of money when running some potentially very expensive queries!

That being said, you're spot-on about memory usage being a key metric here. One of the key advantages of having native types for multimodal data (e.g. tensors and images) is that we can much more tightly account for memory requirements when working with these types, beyond the usual black-box Python UDF which often results in a ton of out-of-memory issues.

Our current mechanism for dealing with this is relying on Ray's excellent object spilling mechanisms for working with out-of-core data. We recognize that there are many situations in which this is insufficient.

The team is working on many advanced features here (e.g. microbatching) that will give Daft a big boost, and will release benchmarks as soon as we have them!

[Edit: typos!]

1 comments

Thanks you for your responses !

I'm going to be very blunt here, because you need to hear this to go forward :

You HAVE TO be at least API-compatible with Polars or Pandas to exist. Being backend-compatible with arrow is not enough.

There is no technical reason why you would not pick one and go with it, apart from being a very difficult task.

As of today, I have 2 major pains : Pandas being a giant memory hog and Polars not being a drop-in replacement for Pandas. I am pushing Polars, as hard as I can in all projects I can touch, and it's a very long way from being the default DataFrame library. Data Scientists will continue to use Pandas for the foreseeable future, and that saddens me greatly because I will also have to work with OOMKilled pipelines for the foreseeable future.

There is no place for a 3rd alternative, so either you become a "distribution bridge for Polars", and that would be absolutely amazing. Or, you go your own way, I'll put a small star on github, a "Noice!" in the comment section and move on and never come back.

It's tough, but sadly real.

Hi! (one of the Daft maintainers here), thanks for the feedback. Ultimately you're right that supporting the full Polars syntax in a distributed fashion is very difficult. There are libraries out there that do "Pandas but distributed" but from what I have seen is that they prioritized API coverage rather than performance or memory consumption. So you end up in a similar boat to the situation you mentioned.

We're trying to start with a simpler API that maps well to a distributed query query that we can execute well and then add the features that people request for.

I would love to know what you would want to see in Daft!

Then, maybe the right choice isn't to start a fresh DataFrame library from arrow, but rather leverage Polars and build out the distributed part (in Rust, of course, not in Python).

> We're trying to start with a simpler API that maps well to a distributed query query that we can execute well and then add the features that people request for.

That would have been a good approach on a field that has not been standardised around a single library since its infancy. Polars is beating Pandas in every possible benchmark, yet will continue to struggle for adoption "until the end". Do you really think Daft can do better ? (If yes, go ahaid, and prove me wrong !)

As a comparison, it's like trying to introduce a new transport layer protocol (https://en.wikipedia.org/wiki/QUIC) against TCP. You can do that if and only if there are obvious benefits, no drawbacks and you are prepared to wait 15 years for 30% market share.