Hacker News new | ask | show | jobs
by educationdata 2546 days ago
"cherry picked"? There is no doubt that data.table is generally way faster than dplyr, unless you cherry pick a few use cases.

The problem is that dplyr is much slower than data.table and RStudio is promoting tidyverse too much to make the slower choice a default for many users.

The article is 100% correct on this issue.

3 comments

Depending on the size of your data, you might not care that dplyr is slower than data.table. If you're better at writing/composing dplyr, you can often make up the speed difference between the two in terms of the savings in time spent writing and reading code. And if your data is that large, there are solutions like dbplyr out there to run dplyr code on various backends and offload the computation outside of R.
In practice, if there's ever a case that there's "too much" data such that dplyr starts to hang (e.g. millions of rows, hundreds of columns), you would get better value by setting up a database first with the data. Which you can then query with dbplyr!
data.table can handle millions of rows easily, as long as the data can fit in the memory.
Yeah, I'm surprised we're having performance arguments about these two libraries with mostly undefined performance characteristics which both run on a single-threaded runtime.
data.table does multithreading for a number of common operations. Running in parallel (not multithreaded) is quite well supported.
data.table’s OpenMP stuff is pretty haphazard, and can’t parallelise anything that calls back into R code. And anything outside of this involving forking lots has just been painful every time I’ve seen it, and again, way slower than doing it on a more performant platform up front.
> If you're better at writing/composing dplyr, you can often make up the speed difference between the two in terms of the savings in time spent writing and reading code.

dplyr syntax is definitely more concise and readable than base R, but comparing to data.table I don't think it has any advantage in terms of saving time writing or reading code.

I think the article sort of punts on providing examples of a complicated set of operations on a data frame. dplyr's author provides what I think is a good example of the differences between data.table and dplyr on a reasonably complex problem:

https://stackoverflow.com/questions/21435339/data-table-vs-d...

I feel like the first example is far more readable than the second. People can disagree on this, but the adoption rates of dplyr versus data.table do suggest (don't prove, but suggest) that the consensus on the issue leans towards dplyr. As we've noted, people certainly aren't adopting dplyr for the speed.

The second example should be written like this:

  diamondsDT[cut != "Fair", .(AvgPrice = mean(price),
                              MedianPrice = as.numeric(median(price)),
                              Count = .N), cut][order(-Count)]
There is no need to break it to 10 lines.
I honestly find that less readable than Hadley's version, especially turning "by = cut" into "cut." This is where terseness really cuts into readability (and positional arguments is one of my least favorite features about R in terms of long-term readability of code).
It’s right on this issue, but the Tidyverse is a collection of packages, and most are speedy. Discussing the rare case that supports an argument but neglecting the numerous others that do not support it is the definition of cherry picking.

And I don’t expect RStudio to say, “we have this collection, which works fine for most users, except in this case you should replace package x with y, or in this corner case you might like package z.”

RStudio doesn’t need to promote a fragmented ecosystem if they don’t want to, it won’t cause the death of R.

See my other comment, but I would be curious to know how many users actually could detect the speed difference between a data.table and a tidy solution. Speaks to how small most datasets really are IMO