"Massively better performance" is a bit misleading: Julia is only massively better at certain workflows. The fastest data.frame library in ALL interpreted languages is consistently data.table, which is R. For in-memory data analysis, Julia will have to offer more than performance to win over statisticians/researchers.
As another commenter pointed out, DataFrames.jl is already faster than data.table in some benchmarks.
And that's the killer feature of Julia. It is easier to micro-optimize Julia code than any other language, static or dynamic. Meaning if Julia is not best-in-class in a certain algorithm, it will soon.
In addition to the comment about df.jl catching up, they aren't comparable at all.
Julia's DF library is generic and allows user defined ops and types. You can put in GPU vectors, distributed vectors, custom number types etc. Julia optimizes all this stuff.
data.frame is just a giant chunk of c (c++) code that one must interact with in very specific ways
> Julia's DF library is generic and allows user defined ops and types. You can put in GPU vectors, distributed vectors, custom number types etc. Julia optimizes all this stuff.
These features aren't of interest to practicing statisticians, which the parent comment was talking about.
> data.frame is just a giant chunk of c (c++) code that one must interact with in very specific ways
I don't understand this criticism: yes, data.table has an API.
Many practicing statisticians do actually care about easily using GPUs and doing distributed computations on distributed data sets with the same code they use for a local data set, which is what those Julia capabilities give you.
> The fastest data.frame library in ALL interpreted languages is consistently data.table, which is R.
DataFrames.jl is very rapidly catching up and starting to surpass it. After hitting a stable v1.0 they've begun focusing on performance and those benchmarks have changed significantly over the past three months. Here's the live view: https://h2oai.github.io/db-benchmark/
Oh I agree. What's convincing to me is the momentum. The DataFrames.jl team only started focusing on performance three months ago after hitting v1.0[1] and were able to rapidly become competitive with groupbys; the performance of join is next[2]. Compare the live view with the state when grandparent's blog post was written/updated (March of this year).
I expect it to continue to improve; note that it's starting to be the fastest implementation on some of the groupby benchmarks.
Many do, a university cluster is usually full since it runs 3 day-long jobs from hundreds of people. But in order to switch I’d need to replace 100+ direct and indirect dependencies.
While writing in C is one way to speed up R code, you can also get pretty close to compiled speed by writing fully vectorized R code and pre-allocating vectors. The R REPL is just a thin wrapper over a bunch of C functions, and a careful programmer can ensure that allocation and copy operations (the slow bits) are kept to a minimum.
This is good if your code can be expressed in vectorized operations and doesn't gain benefits from problem structure exploited by multiple dispatch. With R the best you can is the speed of someone else's (or your) C code while Julia can beat C.
You’d be surprised how many people don’t know that a for loop isn’t great compared to vectorizing. The same for Julia, few will know that types have an impact on speed. My point is that you won’t automatically write faster code.
I'm wondering if most statisticians or researchers deal with data big enough that massively better performance would be enough motivation to switch.