Hacker News new | ask | show | jobs
by cwyers 2547 days ago
Depending on the size of your data, you might not care that dplyr is slower than data.table. If you're better at writing/composing dplyr, you can often make up the speed difference between the two in terms of the savings in time spent writing and reading code. And if your data is that large, there are solutions like dbplyr out there to run dplyr code on various backends and offload the computation outside of R.
3 comments

In practice, if there's ever a case that there's "too much" data such that dplyr starts to hang (e.g. millions of rows, hundreds of columns), you would get better value by setting up a database first with the data. Which you can then query with dbplyr!
data.table can handle millions of rows easily, as long as the data can fit in the memory.
Yeah, I'm surprised we're having performance arguments about these two libraries with mostly undefined performance characteristics which both run on a single-threaded runtime.
data.table does multithreading for a number of common operations. Running in parallel (not multithreaded) is quite well supported.
data.table’s OpenMP stuff is pretty haphazard, and can’t parallelise anything that calls back into R code. And anything outside of this involving forking lots has just been painful every time I’ve seen it, and again, way slower than doing it on a more performant platform up front.
> If you're better at writing/composing dplyr, you can often make up the speed difference between the two in terms of the savings in time spent writing and reading code.

dplyr syntax is definitely more concise and readable than base R, but comparing to data.table I don't think it has any advantage in terms of saving time writing or reading code.

I think the article sort of punts on providing examples of a complicated set of operations on a data frame. dplyr's author provides what I think is a good example of the differences between data.table and dplyr on a reasonably complex problem:

https://stackoverflow.com/questions/21435339/data-table-vs-d...

I feel like the first example is far more readable than the second. People can disagree on this, but the adoption rates of dplyr versus data.table do suggest (don't prove, but suggest) that the consensus on the issue leans towards dplyr. As we've noted, people certainly aren't adopting dplyr for the speed.

The second example should be written like this:

  diamondsDT[cut != "Fair", .(AvgPrice = mean(price),
                              MedianPrice = as.numeric(median(price)),
                              Count = .N), cut][order(-Count)]
There is no need to break it to 10 lines.
I honestly find that less readable than Hadley's version, especially turning "by = cut" into "cut." This is where terseness really cuts into readability (and positional arguments is one of my least favorite features about R in terms of long-term readability of code).