Hacker News new | ask | show | jobs
by mbauman 1800 days ago
> The fastest data.frame library in ALL interpreted languages is consistently data.table, which is R.

DataFrames.jl is very rapidly catching up and starting to surpass it. After hitting a stable v1.0 they've begun focusing on performance and those benchmarks have changed significantly over the past three months. Here's the live view: https://h2oai.github.io/db-benchmark/

2 comments

40% slower in groupbys and 4x slower in joins isn’t convincing.
Oh I agree. What's convincing to me is the momentum. The DataFrames.jl team only started focusing on performance three months ago after hitting v1.0[1] and were able to rapidly become competitive with groupbys; the performance of join is next[2]. Compare the live view with the state when grandparent's blog post was written/updated (March of this year).

I expect it to continue to improve; note that it's starting to be the fastest implementation on some of the groupby benchmarks.

1. https://discourse.julialang.org/t/release-announcements-for-...

2. https://discourse.julialang.org/t/the-state-of-dataframes-jl...

This seems quiet cherry picked as there are 3 different dataset sizes.

However yes, it does not beat all other packages tested in performance.

Not really cherry picked. Data.table is designed for large data sets with many groups + complex joins.
I would like to see this benchmark with much more modern hardware, especially for GPU-related tools as the 1080 Ti they used is 4 years old.