| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by educationdata 2546 days ago

"cherry picked"? There is no doubt that data.table is generally way faster than dplyr, unless you cherry pick a few use cases.

The problem is that dplyr is much slower than data.table and RStudio is promoting tidyverse too much to make the slower choice a default for many users.

The article is 100% correct on this issue.

3 comments

cwyers 2546 days ago

Depending on the size of your data, you might not care that dplyr is slower than data.table. If you're better at writing/composing dplyr, you can often make up the speed difference between the two in terms of the savings in time spent writing and reading code. And if your data is that large, there are solutions like dbplyr out there to run dplyr code on various backends and offload the computation outside of R.

link

minimaxir 2546 days ago

In practice, if there's ever a case that there's "too much" data such that dplyr starts to hang (e.g. millions of rows, hundreds of columns), you would get better value by setting up a database first with the data. Which you can then query with dbplyr!

link

educationdata 2546 days ago

data.table can handle millions of rows easily, as long as the data can fit in the memory.

link

thom 2546 days ago

Yeah, I'm surprised we're having performance arguments about these two libraries with mostly undefined performance characteristics which both run on a single-threaded runtime.

link

Bootvis 2546 days ago

data.table does multithreading for a number of common operations. Running in parallel (not multithreaded) is quite well supported.

link

thom 2546 days ago

data.table’s OpenMP stuff is pretty haphazard, and can’t parallelise anything that calls back into R code. And anything outside of this involving forking lots has just been painful every time I’ve seen it, and again, way slower than doing it on a more performant platform up front.

link

educationdata 2546 days ago

> If you're better at writing/composing dplyr, you can often make up the speed difference between the two in terms of the savings in time spent writing and reading code.

dplyr syntax is definitely more concise and readable than base R, but comparing to data.table I don't think it has any advantage in terms of saving time writing or reading code.

link

cwyers 2546 days ago

I think the article sort of punts on providing examples of a complicated set of operations on a data frame. dplyr's author provides what I think is a good example of the differences between data.table and dplyr on a reasonably complex problem:

https://stackoverflow.com/questions/21435339/data-table-vs-d...

I feel like the first example is far more readable than the second. People can disagree on this, but the adoption rates of dplyr versus data.table do suggest (don't prove, but suggest) that the consensus on the issue leans towards dplyr. As we've noted, people certainly aren't adopting dplyr for the speed.

link

educationdata 2546 days ago

The second example should be written like this:

  diamondsDT[cut != "Fair", .(AvgPrice = mean(price),
                              MedianPrice = as.numeric(median(price)),
                              Count = .N), cut][order(-Count)]

There is no need to break it to 10 lines.

link

cwyers 2546 days ago

I honestly find that less readable than Hadley's version, especially turning "by = cut" into "cut." This is where terseness really cuts into readability (and positional arguments is one of my least favorite features about R in terms of long-term readability of code).

link

throw20102010 2546 days ago

It’s right on this issue, but the Tidyverse is a collection of packages, and most are speedy. Discussing the rare case that supports an argument but neglecting the numerous others that do not support it is the definition of cherry picking.

And I don’t expect RStudio to say, “we have this collection, which works fine for most users, except in this case you should replace package x with y, or in this corner case you might like package z.”

RStudio doesn’t need to promote a fragmented ecosystem if they don’t want to, it won’t cause the death of R.

link

kickout 2546 days ago

See my other comment, but I would be curious to know how many users actually could detect the speed difference between a data.table and a tidy solution. Speaks to how small most datasets really are IMO

link