Hacker News new | ask | show | jobs
by throw20102010 2543 days ago
I think he’s exaggerating the effect of some of these claims. People who “grew up” learning R+Tidyverse will discriminate against non-Tidyverse users? Give me a break. You are either an advanced or competent or novice R user. You either understand how pipes (%>%) work or you don’t.

I doubt that anyone is going to be denied a job because they are an amazing R programmer but they just don’t have the experience with a particular set of packages, especially those that are as well implemented and easy to learn as the Tidyverse.

I can totally imagine someone not getting a job because they are a crap programmer, and then blaming it on something else. Or I can also imagine someone saying that the Tidyverse packages suck, and then be denied a job because of their attitude.

I could make a similar argument for most of the substantive claims in his post.

Some of these examples (dplyr vs. data.table) are cherry picked. I have several of my own examples where read_csv is way faster than read.csv, so maybe, like all good programmers, we should be testing and profiling our code and implementing the parts that make the most sense for our needs.

The bottom line is that the Tidyverse is a good set of packages and you are free to use them or not. There isn’t some blood feud (like vi vs. emacs) between users and non-users, we all get along just fine.

There are dozens of great tutorials on how to learn R that don’t use the Tidyverse, and RStudio is under no obligation to offer a full course on every possible way to learn R. Any R user is also a competent Google user.

It’s fine to be opinionated. But, a professor as respected as Norm Matloff should be careful of how they say things, or they risk souring their students on a set of packages that might be very useful in the future.

2 comments

"cherry picked"? There is no doubt that data.table is generally way faster than dplyr, unless you cherry pick a few use cases.

The problem is that dplyr is much slower than data.table and RStudio is promoting tidyverse too much to make the slower choice a default for many users.

The article is 100% correct on this issue.

Depending on the size of your data, you might not care that dplyr is slower than data.table. If you're better at writing/composing dplyr, you can often make up the speed difference between the two in terms of the savings in time spent writing and reading code. And if your data is that large, there are solutions like dbplyr out there to run dplyr code on various backends and offload the computation outside of R.
In practice, if there's ever a case that there's "too much" data such that dplyr starts to hang (e.g. millions of rows, hundreds of columns), you would get better value by setting up a database first with the data. Which you can then query with dbplyr!
data.table can handle millions of rows easily, as long as the data can fit in the memory.
Yeah, I'm surprised we're having performance arguments about these two libraries with mostly undefined performance characteristics which both run on a single-threaded runtime.
data.table does multithreading for a number of common operations. Running in parallel (not multithreaded) is quite well supported.
data.table’s OpenMP stuff is pretty haphazard, and can’t parallelise anything that calls back into R code. And anything outside of this involving forking lots has just been painful every time I’ve seen it, and again, way slower than doing it on a more performant platform up front.
> If you're better at writing/composing dplyr, you can often make up the speed difference between the two in terms of the savings in time spent writing and reading code.

dplyr syntax is definitely more concise and readable than base R, but comparing to data.table I don't think it has any advantage in terms of saving time writing or reading code.

I think the article sort of punts on providing examples of a complicated set of operations on a data frame. dplyr's author provides what I think is a good example of the differences between data.table and dplyr on a reasonably complex problem:

https://stackoverflow.com/questions/21435339/data-table-vs-d...

I feel like the first example is far more readable than the second. People can disagree on this, but the adoption rates of dplyr versus data.table do suggest (don't prove, but suggest) that the consensus on the issue leans towards dplyr. As we've noted, people certainly aren't adopting dplyr for the speed.

The second example should be written like this:

  diamondsDT[cut != "Fair", .(AvgPrice = mean(price),
                              MedianPrice = as.numeric(median(price)),
                              Count = .N), cut][order(-Count)]
There is no need to break it to 10 lines.
I honestly find that less readable than Hadley's version, especially turning "by = cut" into "cut." This is where terseness really cuts into readability (and positional arguments is one of my least favorite features about R in terms of long-term readability of code).
It’s right on this issue, but the Tidyverse is a collection of packages, and most are speedy. Discussing the rare case that supports an argument but neglecting the numerous others that do not support it is the definition of cherry picking.

And I don’t expect RStudio to say, “we have this collection, which works fine for most users, except in this case you should replace package x with y, or in this corner case you might like package z.”

RStudio doesn’t need to promote a fragmented ecosystem if they don’t want to, it won’t cause the death of R.

See my other comment, but I would be curious to know how many users actually could detect the speed difference between a data.table and a tidy solution. Speaks to how small most datasets really are IMO
He was careful how he said it; thus the many disclaimers about being a fan of tidyverse in general. I think you are underestimating the effect of tidyverse being taught. Also, RStudio is absolutely free to push whatever packages they want. Same as R users in general are free to use any package they want. But if most people learn the tidyverse, thats invariably what they will default to whether people like it or not.
Sure, he puts the required disclaimers about respect in the beginning so that way he doesn't get accused of being hateful. But then he accuses RStudio of doing "an end run around the core R leadership team," and that what they are doing is "bad for the health of the project," which are pretty strong words.

RStudio isn't on the leadership team. If the language ends up dying (which it won't, R is dug into it's place like a tick), it would be due to the leadership team's refusal to adapt to good ideas coming in from the community.