As an example, Douglas Bates, the author of R's lme4 excellent package for generalized linear mixed-effects models, has switched to julia to develop MixedModels.jl. The julia version is already excellent, and has many improvements over lme4.
"Massively better performance" is a bit misleading: Julia is only massively better at certain workflows. The fastest data.frame library in ALL interpreted languages is consistently data.table, which is R. For in-memory data analysis, Julia will have to offer more than performance to win over statisticians/researchers.
As another commenter pointed out, DataFrames.jl is already faster than data.table in some benchmarks.
And that's the killer feature of Julia. It is easier to micro-optimize Julia code than any other language, static or dynamic. Meaning if Julia is not best-in-class in a certain algorithm, it will soon.
In addition to the comment about df.jl catching up, they aren't comparable at all.
Julia's DF library is generic and allows user defined ops and types. You can put in GPU vectors, distributed vectors, custom number types etc. Julia optimizes all this stuff.
data.frame is just a giant chunk of c (c++) code that one must interact with in very specific ways
> Julia's DF library is generic and allows user defined ops and types. You can put in GPU vectors, distributed vectors, custom number types etc. Julia optimizes all this stuff.
These features aren't of interest to practicing statisticians, which the parent comment was talking about.
> data.frame is just a giant chunk of c (c++) code that one must interact with in very specific ways
I don't understand this criticism: yes, data.table has an API.
Many practicing statisticians do actually care about easily using GPUs and doing distributed computations on distributed data sets with the same code they use for a local data set, which is what those Julia capabilities give you.
> The fastest data.frame library in ALL interpreted languages is consistently data.table, which is R.
DataFrames.jl is very rapidly catching up and starting to surpass it. After hitting a stable v1.0 they've begun focusing on performance and those benchmarks have changed significantly over the past three months. Here's the live view: https://h2oai.github.io/db-benchmark/
Oh I agree. What's convincing to me is the momentum. The DataFrames.jl team only started focusing on performance three months ago after hitting v1.0[1] and were able to rapidly become competitive with groupbys; the performance of join is next[2]. Compare the live view with the state when grandparent's blog post was written/updated (March of this year).
I expect it to continue to improve; note that it's starting to be the fastest implementation on some of the groupby benchmarks.
Many do, a university cluster is usually full since it runs 3 day-long jobs from hundreds of people. But in order to switch I’d need to replace 100+ direct and indirect dependencies.
While writing in C is one way to speed up R code, you can also get pretty close to compiled speed by writing fully vectorized R code and pre-allocating vectors. The R REPL is just a thin wrapper over a bunch of C functions, and a careful programmer can ensure that allocation and copy operations (the slow bits) are kept to a minimum.
This is good if your code can be expressed in vectorized operations and doesn't gain benefits from problem structure exploited by multiple dispatch. With R the best you can is the speed of someone else's (or your) C code while Julia can beat C.
You’d be surprised how many people don’t know that a for loop isn’t great compared to vectorizing. The same for Julia, few will know that types have an impact on speed. My point is that you won’t automatically write faster code.
The majority of researchers don’t care about the language superiority. They’re concerned with different issues and software tends to suffer from “publish and forget” attitude. Convenience matters, and R ecosystem is quite good.
As a scientist programmer, that has not been my experience. In my experience, science programming is characterized by having to implement a lot of stuff from the ground up yourself, because unlike web dev or containerization, it's unlikely there is any existing library for metagenomic analysis of modified RNA.
And here Julia is a complete Godsend, since it makes it a joy to implement things from the bottom up.
Sure, you also need a language that already has dataframe libraries, plotting, editor support et cetera, and Julia is lacking behind Python and R in these areas. But Julia's getting there, and at the end of the day, it's a relatively low number of packages that are must-haves.
> In my experience, science programming is characterized by having to implement a lot of stuff from the ground up yourself
It depends on the field, there’re hundreds of biological publications each month that just use existing software. And if I’m developing a new tool for single-cell analysis, it’s either going to be interoperable with Seurat or Bioconductor tools.