| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by isoprophlex 1697 days ago

The relevant results from the linked article:

    ##    format          median_time  mem_alloc
    ## 1  R (RDS)               1.34m     4.08GB
    ## 2  SQL (SQLite)          5.48s     6.17MB
    ## 3  SQL (DuckDB)          1.76s   104.66KB
    ## 4  Arrow (Parquet)       1.36s   453.89MB

I'd bet that doing the same with Pandas would require time and space similar to RDS (1). I really hope DuckDB makes it in the Python world, everything I read about it seems very promising. Using it myself for toy projects was pleasant, too.

6 comments

ryndbfsrw 1697 days ago

Have you used the Polars (https://www.pola.rs/) package? It does what Pandas does with a fraction of the RAM and twice the speed

link

thinker5555 1697 days ago

Thanks for that! Polars looks really interesting.

link

kristjansson 1696 days ago

Of course, since the memory allocation comes from Rprofmem (via benchmark::mark), this only measures allocations of memory for objects on R's heap. Allocations by C extensions (like DuckDB and SQLite) aren't tracked. They're surely _more_ space efficient than just reading everything into RAM, but perhaps by a smaller margin than shown here.

link

hantusk 1697 days ago

this benchmark is more comprehensive for this type of analytical work:

https://h2oai.github.io/db-benchmark/

link

avidphantasm 1697 days ago

Here’s hoping that DuckDB will add support for spatial data, indexing, and query predicates. It would be great if this were a first-class feature instead of bolted on like SpatiaLite is to SQLite.

link

loxias 1697 days ago

It could! But those are not exactly "easy things to do" ;) I'm sure it could happen, for a sufficiently large grant.

link

avidphantasm 1697 days ago

Are you a contributor? Is there a way to bring this up with the team? Perhaps a place to start would be to add support for storing OGC Simple Features and corresponding to/from conversion functions. I don’t have a lot of spare time, but may want to take a stab at a proof of concept if some developers could help orient me to the code.

link

mytherin 1696 days ago

DuckDB developer here - we absolutely welcome outside contributions. Feel free to open an issue or discussion on our github for a feature request, and we would be happy to point you in the right direction!

link

avidphantasm 1697 days ago

Maybe I should get involved and try to help this…however DB hacking is a bit out of my depth at the moment…

link

andyferris 1697 days ago

I thought pandas was in some sense evolving towards arrow? (For those who aren’t aware, they share a co-creator).

Edit: that said a benchmark would be worthwhile, and similarly the tidiverse should evolve towards arrow speed I hope since they also share a co-creator.

link

petespeed 1697 days ago

https://duckdb.org/docs/api/python

link