Hacker News new | ask | show | jobs
by isoprophlex 1697 days ago
The relevant results from the linked article:

    ##    format          median_time  mem_alloc
    ## 1  R (RDS)               1.34m     4.08GB
    ## 2  SQL (SQLite)          5.48s     6.17MB
    ## 3  SQL (DuckDB)          1.76s   104.66KB
    ## 4  Arrow (Parquet)       1.36s   453.89MB
I'd bet that doing the same with Pandas would require time and space similar to RDS (1). I really hope DuckDB makes it in the Python world, everything I read about it seems very promising. Using it myself for toy projects was pleasant, too.
6 comments

Have you used the Polars (https://www.pola.rs/) package? It does what Pandas does with a fraction of the RAM and twice the speed
Thanks for that! Polars looks really interesting.
Of course, since the memory allocation comes from Rprofmem (via benchmark::mark), this only measures allocations of memory for objects on R's heap. Allocations by C extensions (like DuckDB and SQLite) aren't tracked. They're surely _more_ space efficient than just reading everything into RAM, but perhaps by a smaller margin than shown here.
this benchmark is more comprehensive for this type of analytical work:

https://h2oai.github.io/db-benchmark/

Here’s hoping that DuckDB will add support for spatial data, indexing, and query predicates. It would be great if this were a first-class feature instead of bolted on like SpatiaLite is to SQLite.
It could! But those are not exactly "easy things to do" ;) I'm sure it could happen, for a sufficiently large grant.
Are you a contributor? Is there a way to bring this up with the team? Perhaps a place to start would be to add support for storing OGC Simple Features and corresponding to/from conversion functions. I don’t have a lot of spare time, but may want to take a stab at a proof of concept if some developers could help orient me to the code.
DuckDB developer here - we absolutely welcome outside contributions. Feel free to open an issue or discussion on our github for a feature request, and we would be happy to point you in the right direction!
Maybe I should get involved and try to help this…however DB hacking is a bit out of my depth at the moment…
I thought pandas was in some sense evolving towards arrow? (For those who aren’t aware, they share a co-creator).

Edit: that said a benchmark would be worthwhile, and similarly the tidiverse should evolve towards arrow speed I hope since they also share a co-creator.