Hacker News new | ask | show | jobs
by mespe 2545 days ago
I teach and consult on R and data science. I had the privilege of learning R from one of R's core developers. My students often ask why I don't use the Tidyverse. The answer is because I don't need to - I can do everything the Tidyverse does and so much more in base R.

This article only briefly touches on what I think is the biggest issue with the Tidyverse. The Tidyverse is incredibly limiting. The "tidy" workflow hides many details of the language, which leads many users to think all data has to be in a tidy "data.frame" (or now tibble) and organized just so. The functions the Tidyverse authors have chosen to implement are seen as the limits of R's capabilities. Every data problem has to be forced into the Tidyverse box (or it is impossible). I cannot tell you how many Tidyverse scripts I have seen where 90% of the operations are munging the data into the correct format for the Tidyverse functions, when the original data can be handled with a single base R function (e.g. lapply). Most Tidyverse users accept this as just the way R is.

Most R users who only learn the Tidyverse never hit its limits, and for them the Tidyverse is perfectly OK. Those that do either resign themselves to the perceived limits of R, or (hopefully) start learning some base R. One friend who took the latter path once exclaimed to me "Logical subsetting is amazing!?!" - this is a foundation piece of the language. That she went 2+ years in the "Tidyverse" without even knowing it was an option was eye opening to me.

To its credit, the Tidyverse is empowering - with very little programming knowledge a beginner can do a lot of data science. However, the majority of these people get stuck as "expert beginners". Some of these become fierce advocates of the Tidyverse without truly understanding base R. Meanwhile, none of the truly advanced R users I encounter use the Tidyverse.

ed.typo

2 comments

"Meanwhile, none of the truly advanced R users I encounter use the Tidyverse."

This is most likely because truly advanced R users have been using R since far before the Tidyverse existed. A whole new generation of R users is being brought up with the Tidyverse, so im curious to see how the situation will be in 10 years time.

This is not entirely true - many are of the same "generation" as me. To be clear, I don't actually consider myself an advanced user. To me, many of these advanced users are pushing against the limits of the language in a way you don't see with run of the mill data science.

That said, I have heard my mentor say "I don't get the point of <insert tidyverse function> - I did this 20 years ago."

> That said, I have heard my mentor say "I don't get the point of <insert tidyverse function> - I did this 20 years ago."

Tidyverse has a tendency to promote "new" things without any references to what came before. If you listen to their talks they speak as if they invented functional programming and the idea of pure and tiny functions working together.

And it trickles down to the users. A lot of people who learned tidyverse first for example, praise the `purrr` package, but have no idea that something like `Map()` is in base R.

Can you provide an example of a problem that requires multiple tidyverse operations, but could be solved equally well using only lapply?
I run a large scale national survey. We download the data from our survey platform. Survey respondents are asked 100+ questions. The questions change week to week and so the column names are not consistent. We exclude respondents who appear to be cheating the system (rushing through questions, straight-lining, skipping almost every question, etc.) As part of our completion check, we want to do a row-wise map of the data frame e.g. apply(respondents, 1, completion_function). For the sake of this post, let's say that the desired completion function is simply to return the number of NAs in the row -- sum(is.na(x)). I don't care about the name of the variables, I just want to treat each row as a vector one at a time, then perform the operation, and return it as a mutated variable.

When I wrote code to do this, the preferred tidyverse routes would either be to 1:nrow(df) %>% map(function(index) { row = df[index, ] }) or else df %>% pmap(function(.... laundry list of variables here)) { }. A brief Google shows this has gotten worse in the last few years as dplyr deprecated rowwise operations. I do see stack overflow posts of people writing their own tidy/pipe-friendly row-wise iteration functions, but nothing official.

Maybe I missed something. I am a reasonably competent R programmer and package author, but I don't live and breathe tidyverse data-wrangling the way some people do.

You are right that tidy data is in a different form that many supplied tables are.

If I understand correctly, you want to know how many NA's there are in each column in a wide-form dataset (as opposed to a tidy dataset)

    # One line to make the data tidy.
    # The form of data will be 3 columns: id, question, answer, and no, we don't care what the columns are called, except for id.
    tidydf <- df %>% gather("question", "answer", -id) 

    # one line to do your check
    tidydf %>% group_by(id) %>% summarise(n_NA = sum(is.na(answer)))
Tidyverse is highly opinionated about its data structure, and it is one of its limiting factors, as it basically treats every dataset as a sparse dataset. This actually fits very well with your data, as a datapoint is not a fixed questionnaire, but rather a datapoint is a respondents answer to a question (as questionnaires vary in questions, a tall table layout is quite fitting).

From there on you have to think in groups and summaries, unless you wanna fight the library.

Tidyverse is an 80% datascience solution. It solves what you need 80% of the time really, really well, and the last 20% you either have to fall back to base R or really torture dplyr.

I agree that an alternative way to do this would be to mutate an ID column (maybe row number), then gather, then summarize, except that this of course will throw away all the rest of the data, so not great if all you want to do is add a column. Hence, I normally map rows or use base R.
Thank you for providing this example! I share your experience that rowwise operations seem more difficult to program using the tidyverse than using apply.
map() in purrr is functionally equivalent to lapply(). If you can do something in lapply, you can do it with map.
Right, which is why my example used its cousin, apply, instead of lapply. apply over the row margins of a data frame does not have an equivalent in tidyverse.
Beware that apply() coerces data frames to matrices, which is time consuming and forces all columns to have the same type.