| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mr_overalls 1798 days ago
	Julia seems like such a superior language compared to R. What would be required for it to supplant R for statistical work (or some subset of it)?

4 comments

Duller-Finite 1798 days ago

As an example, Douglas Bates, the author of R's lme4 excellent package for generalized linear mixed-effects models, has switched to julia to develop MixedModels.jl. The julia version is already excellent, and has many improvements over lme4.

link

lycopodiopsida 1798 days ago

“Only” to write a very high amount of high-quality statistical and plot packages…

link

mr_overalls 1798 days ago

Right. R's killer feature is its ecosystem.

I'm wondering if most statisticians or researchers deal with data big enough that massively better performance would be enough motivation to switch.

link

tylermw 1798 days ago

"Massively better performance" is a bit misleading: Julia is only massively better at certain workflows. The fastest data.frame library in ALL interpreted languages is consistently data.table, which is R. For in-memory data analysis, Julia will have to offer more than performance to win over statisticians/researchers.

Benchmarks: https://www.ritchievink.com/blog/2021/02/28/i-wrote-one-of-t...

link

snicker7 1798 days ago

As another commenter pointed out, DataFrames.jl is already faster than data.table in some benchmarks.

And that's the killer feature of Julia. It is easier to micro-optimize Julia code than any other language, static or dynamic. Meaning if Julia is not best-in-class in a certain algorithm, it will soon.

link

amkkma 1798 days ago

In addition to the comment about df.jl catching up, they aren't comparable at all.

Julia's DF library is generic and allows user defined ops and types. You can put in GPU vectors, distributed vectors, custom number types etc. Julia optimizes all this stuff.

data.frame is just a giant chunk of c (c++) code that one must interact with in very specific ways

link

tylermw 1798 days ago

> Julia's DF library is generic and allows user defined ops and types. You can put in GPU vectors, distributed vectors, custom number types etc. Julia optimizes all this stuff.

These features aren't of interest to practicing statisticians, which the parent comment was talking about.

> data.frame is just a giant chunk of c (c++) code that one must interact with in very specific ways

I don't understand this criticism: yes, data.table has an API.

link

amkkma 1798 days ago

>These features aren't of interest to practicing statisticians, which the parent comment was talking about.

It's pretty convenient for things like uncertainty propagation and data cleaning...all things statisticians should care about.

>I don't understand this criticism: yes, data.table has an API

A relatively limited API, walled off from the rest of the language.

link

StefanKarpinski 1798 days ago

Many practicing statisticians do actually care about easily using GPUs and doing distributed computations on distributed data sets with the same code they use for a local data set, which is what those Julia capabilities give you.

link

mbauman 1798 days ago

> The fastest data.frame library in ALL interpreted languages is consistently data.table, which is R.

DataFrames.jl is very rapidly catching up and starting to surpass it. After hitting a stable v1.0 they've begun focusing on performance and those benchmarks have changed significantly over the past three months. Here's the live view: https://h2oai.github.io/db-benchmark/

link

nojito 1798 days ago

40% slower in groupbys and 4x slower in joins isn’t convincing.

link

mbauman 1798 days ago

Oh I agree. What's convincing to me is the momentum. The DataFrames.jl team only started focusing on performance three months ago after hitting v1.0[1] and were able to rapidly become competitive with groupbys; the performance of join is next[2]. Compare the live view with the state when grandparent's blog post was written/updated (March of this year).

I expect it to continue to improve; note that it's starting to be the fastest implementation on some of the groupby benchmarks.

1. https://discourse.julialang.org/t/release-announcements-for-...

2. https://discourse.julialang.org/t/the-state-of-dataframes-jl...

link

freemint 1798 days ago

This seems quiet cherry picked as there are 3 different dataset sizes.

However yes, it does not beat all other packages tested in performance.

link

wdroz 1798 days ago

I would like to see this benchmark with much more modern hardware, especially for GPU-related tools as the 1080 Ti they used is 4 years old.

link

f6v 1798 days ago

Many do, a university cluster is usually full since it runs 3 day-long jobs from hundreds of people. But in order to switch I’d need to replace 100+ direct and indirect dependencies.

link

f6v 1798 days ago

You can write fast software with R, you just need to know how. The same applies to Julia - not everyone knows how to develop high-performance code.

link

spywaregorilla 1798 days ago

> You can write fast software with R, you just need to know how.

When the trick to writing fast R code is to rely on C as much as possible, that feels less compelling.

link

tylermw 1798 days ago

While writing in C is one way to speed up R code, you can also get pretty close to compiled speed by writing fully vectorized R code and pre-allocating vectors. The R REPL is just a thin wrapper over a bunch of C functions, and a careful programmer can ensure that allocation and copy operations (the slow bits) are kept to a minimum.

link

freemint 1798 days ago

This is good if your code can be expressed in vectorized operations and doesn't gain benefits from problem structure exploited by multiple dispatch. With R the best you can is the speed of someone else's (or your) C code while Julia can beat C.

link

f6v 1798 days ago

You’d be surprised how many people don’t know that a for loop isn’t great compared to vectorizing. The same for Julia, few will know that types have an impact on speed. My point is that you won’t automatically write faster code.

link

systemvoltage 1798 days ago

And clone/bribe Hadley Wickham :-) He is a tour de force of R.

link

amkkma 1798 days ago

It's already superior to R for data munging stuff, imo

https://twitter.com/evalparse/status/1416039770833096706

And https://github.com/JuliaPlots/AlgebraOfGraphics.jl >>> GoG

link

tfehring 1798 days ago

Or more realistically, a caret/parsnip-like interface that lets you seamlessly use either R or Julia as a backend.

link

f6v 1798 days ago

The majority of researchers don’t care about the language superiority. They’re concerned with different issues and software tends to suffer from “publish and forget” attitude. Convenience matters, and R ecosystem is quite good.

link

jakobnissen 1798 days ago

As a scientist programmer, that has not been my experience. In my experience, science programming is characterized by having to implement a lot of stuff from the ground up yourself, because unlike web dev or containerization, it's unlikely there is any existing library for metagenomic analysis of modified RNA.

And here Julia is a complete Godsend, since it makes it a joy to implement things from the bottom up.

Sure, you also need a language that already has dataframe libraries, plotting, editor support et cetera, and Julia is lacking behind Python and R in these areas. But Julia's getting there, and at the end of the day, it's a relatively low number of packages that are must-haves.

link

f6v 1798 days ago

> In my experience, science programming is characterized by having to implement a lot of stuff from the ground up yourself

It depends on the field, there’re hundreds of biological publications each month that just use existing software. And if I’m developing a new tool for single-cell analysis, it’s either going to be interoperable with Seurat or Bioconductor tools.

link

leephillips 1798 days ago

Exactly. Almost all of it is bespoke implementations, sometimes of an algorithm that has just been invented and not yet applied to a real problem.

link

xiaodai 1798 days ago

need better basic stuff like pca and glm.

link