Hacker News new | ask | show | jobs
by tylermw 1798 days ago
"Massively better performance" is a bit misleading: Julia is only massively better at certain workflows. The fastest data.frame library in ALL interpreted languages is consistently data.table, which is R. For in-memory data analysis, Julia will have to offer more than performance to win over statisticians/researchers.

Benchmarks: https://www.ritchievink.com/blog/2021/02/28/i-wrote-one-of-t...

3 comments

As another commenter pointed out, DataFrames.jl is already faster than data.table in some benchmarks.

And that's the killer feature of Julia. It is easier to micro-optimize Julia code than any other language, static or dynamic. Meaning if Julia is not best-in-class in a certain algorithm, it will soon.

In addition to the comment about df.jl catching up, they aren't comparable at all.

Julia's DF library is generic and allows user defined ops and types. You can put in GPU vectors, distributed vectors, custom number types etc. Julia optimizes all this stuff.

data.frame is just a giant chunk of c (c++) code that one must interact with in very specific ways

> Julia's DF library is generic and allows user defined ops and types. You can put in GPU vectors, distributed vectors, custom number types etc. Julia optimizes all this stuff.

These features aren't of interest to practicing statisticians, which the parent comment was talking about.

> data.frame is just a giant chunk of c (c++) code that one must interact with in very specific ways

I don't understand this criticism: yes, data.table has an API.

>These features aren't of interest to practicing statisticians, which the parent comment was talking about.

It's pretty convenient for things like uncertainty propagation and data cleaning...all things statisticians should care about.

>I don't understand this criticism: yes, data.table has an API

A relatively limited API, walled off from the rest of the language.

Many practicing statisticians do actually care about easily using GPUs and doing distributed computations on distributed data sets with the same code they use for a local data set, which is what those Julia capabilities give you.
> The fastest data.frame library in ALL interpreted languages is consistently data.table, which is R.

DataFrames.jl is very rapidly catching up and starting to surpass it. After hitting a stable v1.0 they've begun focusing on performance and those benchmarks have changed significantly over the past three months. Here's the live view: https://h2oai.github.io/db-benchmark/

40% slower in groupbys and 4x slower in joins isn’t convincing.
Oh I agree. What's convincing to me is the momentum. The DataFrames.jl team only started focusing on performance three months ago after hitting v1.0[1] and were able to rapidly become competitive with groupbys; the performance of join is next[2]. Compare the live view with the state when grandparent's blog post was written/updated (March of this year).

I expect it to continue to improve; note that it's starting to be the fastest implementation on some of the groupby benchmarks.

1. https://discourse.julialang.org/t/release-announcements-for-...

2. https://discourse.julialang.org/t/the-state-of-dataframes-jl...

This seems quiet cherry picked as there are 3 different dataset sizes.

However yes, it does not beat all other packages tested in performance.

Not really cherry picked. Data.table is designed for large data sets with many groups + complex joins.
I would like to see this benchmark with much more modern hardware, especially for GPU-related tools as the 1080 Ti they used is 4 years old.