| > So Daft is a distributed Polars ? We did actually start by using Polars as our underlying execution engine, but eventually transitioned off to our own Rust Table abstraction to better suit our needs (e.g. custom datatypes and kernels). We still share the arrow2 dependency with Polars for in-memory representation of our data. > what's the "killer feature" of Daft for trying to compete with Polars (and Pandas) We don't specifically try to compete on a local machine since there is so much good new tooling being made recently available (DuckDB, Polars etc). That being said, we do try our best to make the local experience as seamless as possible because we've all felt the pain of developing locally in PySpark. Aside from the ability to go distributed, I like using Daft for: * Working on more "complex" datatypes such as URLs, images etc * Working with "many [Parquet/CSV/JSON] files in cloud storage (S3)" which we've found to be quite common for many workloads. We already have some and are building more intelligent optimizations here such as column pruning, predicate pushdowns etc to reduce and optimize I/O from the cloud. As you've pointed out, one of our main responsibilities is to handle memory very very well. This is something we're actively working on and I'm thinking this will be a big reason to use us locally as well! > Are they API-compatible? We are not API-compatible with Pandas/Polars, but our API is quite inspired by Polars. We found that building out the core set of dataframe functionality was much more tractable than attempting to go API-compatible from the get-go. > How's the memory consumption benchmark ? (TBH, this is the only interesting metric. Timing and Latency are not really important when your most important competitor is Spark. We think throughput is still important when comparing against Spark, since this can save a lot of money when running some potentially very expensive queries! That being said, you're spot-on about memory usage being a key metric here. One of the key advantages of having native types for multimodal data (e.g. tensors and images) is that we can much more tightly account for memory requirements when working with these types, beyond the usual black-box Python UDF which often results in a ton of out-of-memory issues. Our current mechanism for dealing with this is relying on Ray's excellent object spilling mechanisms for working with out-of-core data. We recognize that there are many situations in which this is insufficient. The team is working on many advanced features here (e.g. microbatching) that will give Daft a big boost, and will release benchmarks as soon as we have them! [Edit: typos!] |
I'm going to be very blunt here, because you need to hear this to go forward :
You HAVE TO be at least API-compatible with Polars or Pandas to exist. Being backend-compatible with arrow is not enough.
There is no technical reason why you would not pick one and go with it, apart from being a very difficult task.
As of today, I have 2 major pains : Pandas being a giant memory hog and Polars not being a drop-in replacement for Pandas. I am pushing Polars, as hard as I can in all projects I can touch, and it's a very long way from being the default DataFrame library. Data Scientists will continue to use Pandas for the foreseeable future, and that saddens me greatly because I will also have to work with OOMKilled pipelines for the foreseeable future.
There is no place for a 3rd alternative, so either you become a "distribution bridge for Polars", and that would be absolutely amazing. Or, you go your own way, I'll put a small star on github, a "Noice!" in the comment section and move on and never come back.
It's tough, but sadly real.