| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cwyers 2547 days ago
	Depending on the size of your data, you might not care that dplyr is slower than data.table. If you're better at writing/composing dplyr, you can often make up the speed difference between the two in terms of the savings in time spent writing and reading code. And if your data is that large, there are solutions like dbplyr out there to run dplyr code on various backends and offload the computation outside of R.

3 comments

minimaxir 2546 days ago

In practice, if there's ever a case that there's "too much" data such that dplyr starts to hang (e.g. millions of rows, hundreds of columns), you would get better value by setting up a database first with the data. Which you can then query with dbplyr!

link

educationdata 2546 days ago

data.table can handle millions of rows easily, as long as the data can fit in the memory.

link

thom 2547 days ago

Yeah, I'm surprised we're having performance arguments about these two libraries with mostly undefined performance characteristics which both run on a single-threaded runtime.

link

Bootvis 2547 days ago

data.table does multithreading for a number of common operations. Running in parallel (not multithreaded) is quite well supported.

link

thom 2546 days ago

data.table’s OpenMP stuff is pretty haphazard, and can’t parallelise anything that calls back into R code. And anything outside of this involving forking lots has just been painful every time I’ve seen it, and again, way slower than doing it on a more performant platform up front.

link

educationdata 2547 days ago

> If you're better at writing/composing dplyr, you can often make up the speed difference between the two in terms of the savings in time spent writing and reading code.

dplyr syntax is definitely more concise and readable than base R, but comparing to data.table I don't think it has any advantage in terms of saving time writing or reading code.

link

cwyers 2547 days ago

I think the article sort of punts on providing examples of a complicated set of operations on a data frame. dplyr's author provides what I think is a good example of the differences between data.table and dplyr on a reasonably complex problem:

https://stackoverflow.com/questions/21435339/data-table-vs-d...

I feel like the first example is far more readable than the second. People can disagree on this, but the adoption rates of dplyr versus data.table do suggest (don't prove, but suggest) that the consensus on the issue leans towards dplyr. As we've noted, people certainly aren't adopting dplyr for the speed.

link

educationdata 2547 days ago

The second example should be written like this:

  diamondsDT[cut != "Fair", .(AvgPrice = mean(price),
                              MedianPrice = as.numeric(median(price)),
                              Count = .N), cut][order(-Count)]

There is no need to break it to 10 lines.

link

cwyers 2546 days ago

I honestly find that less readable than Hadley's version, especially turning "by = cut" into "cut." This is where terseness really cuts into readability (and positional arguments is one of my least favorite features about R in terms of long-term readability of code).

link