Hacker News new | ask | show | jobs
by mr_overalls 1798 days ago
Julia seems like such a superior language compared to R. What would be required for it to supplant R for statistical work (or some subset of it)?
4 comments

As an example, Douglas Bates, the author of R's lme4 excellent package for generalized linear mixed-effects models, has switched to julia to develop MixedModels.jl. The julia version is already excellent, and has many improvements over lme4.
“Only” to write a very high amount of high-quality statistical and plot packages…
Right. R's killer feature is its ecosystem.

I'm wondering if most statisticians or researchers deal with data big enough that massively better performance would be enough motivation to switch.

"Massively better performance" is a bit misleading: Julia is only massively better at certain workflows. The fastest data.frame library in ALL interpreted languages is consistently data.table, which is R. For in-memory data analysis, Julia will have to offer more than performance to win over statisticians/researchers.

Benchmarks: https://www.ritchievink.com/blog/2021/02/28/i-wrote-one-of-t...

As another commenter pointed out, DataFrames.jl is already faster than data.table in some benchmarks.

And that's the killer feature of Julia. It is easier to micro-optimize Julia code than any other language, static or dynamic. Meaning if Julia is not best-in-class in a certain algorithm, it will soon.

In addition to the comment about df.jl catching up, they aren't comparable at all.

Julia's DF library is generic and allows user defined ops and types. You can put in GPU vectors, distributed vectors, custom number types etc. Julia optimizes all this stuff.

data.frame is just a giant chunk of c (c++) code that one must interact with in very specific ways

> Julia's DF library is generic and allows user defined ops and types. You can put in GPU vectors, distributed vectors, custom number types etc. Julia optimizes all this stuff.

These features aren't of interest to practicing statisticians, which the parent comment was talking about.

> data.frame is just a giant chunk of c (c++) code that one must interact with in very specific ways

I don't understand this criticism: yes, data.table has an API.

>These features aren't of interest to practicing statisticians, which the parent comment was talking about.

It's pretty convenient for things like uncertainty propagation and data cleaning...all things statisticians should care about.

>I don't understand this criticism: yes, data.table has an API

A relatively limited API, walled off from the rest of the language.

Many practicing statisticians do actually care about easily using GPUs and doing distributed computations on distributed data sets with the same code they use for a local data set, which is what those Julia capabilities give you.
> The fastest data.frame library in ALL interpreted languages is consistently data.table, which is R.

DataFrames.jl is very rapidly catching up and starting to surpass it. After hitting a stable v1.0 they've begun focusing on performance and those benchmarks have changed significantly over the past three months. Here's the live view: https://h2oai.github.io/db-benchmark/

40% slower in groupbys and 4x slower in joins isn’t convincing.
Oh I agree. What's convincing to me is the momentum. The DataFrames.jl team only started focusing on performance three months ago after hitting v1.0[1] and were able to rapidly become competitive with groupbys; the performance of join is next[2]. Compare the live view with the state when grandparent's blog post was written/updated (March of this year).

I expect it to continue to improve; note that it's starting to be the fastest implementation on some of the groupby benchmarks.

1. https://discourse.julialang.org/t/release-announcements-for-...

2. https://discourse.julialang.org/t/the-state-of-dataframes-jl...

This seems quiet cherry picked as there are 3 different dataset sizes.

However yes, it does not beat all other packages tested in performance.

I would like to see this benchmark with much more modern hardware, especially for GPU-related tools as the 1080 Ti they used is 4 years old.
Many do, a university cluster is usually full since it runs 3 day-long jobs from hundreds of people. But in order to switch I’d need to replace 100+ direct and indirect dependencies.
You can write fast software with R, you just need to know how. The same applies to Julia - not everyone knows how to develop high-performance code.
> You can write fast software with R, you just need to know how.

When the trick to writing fast R code is to rely on C as much as possible, that feels less compelling.

While writing in C is one way to speed up R code, you can also get pretty close to compiled speed by writing fully vectorized R code and pre-allocating vectors. The R REPL is just a thin wrapper over a bunch of C functions, and a careful programmer can ensure that allocation and copy operations (the slow bits) are kept to a minimum.
This is good if your code can be expressed in vectorized operations and doesn't gain benefits from problem structure exploited by multiple dispatch. With R the best you can is the speed of someone else's (or your) C code while Julia can beat C.
You’d be surprised how many people don’t know that a for loop isn’t great compared to vectorizing. The same for Julia, few will know that types have an impact on speed. My point is that you won’t automatically write faster code.
And clone/bribe Hadley Wickham :-) He is a tour de force of R.
Or more realistically, a caret/parsnip-like interface that lets you seamlessly use either R or Julia as a backend.
The majority of researchers don’t care about the language superiority. They’re concerned with different issues and software tends to suffer from “publish and forget” attitude. Convenience matters, and R ecosystem is quite good.
As a scientist programmer, that has not been my experience. In my experience, science programming is characterized by having to implement a lot of stuff from the ground up yourself, because unlike web dev or containerization, it's unlikely there is any existing library for metagenomic analysis of modified RNA.

And here Julia is a complete Godsend, since it makes it a joy to implement things from the bottom up.

Sure, you also need a language that already has dataframe libraries, plotting, editor support et cetera, and Julia is lacking behind Python and R in these areas. But Julia's getting there, and at the end of the day, it's a relatively low number of packages that are must-haves.

> In my experience, science programming is characterized by having to implement a lot of stuff from the ground up yourself

It depends on the field, there’re hundreds of biological publications each month that just use existing software. And if I’m developing a new tool for single-cell analysis, it’s either going to be interoperable with Seurat or Bioconductor tools.

Exactly. Almost all of it is bespoke implementations, sometimes of an algorithm that has just been invented and not yet applied to a real problem.
need better basic stuff like pca and glm.