| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by notafraudster 2546 days ago

I run a large scale national survey. We download the data from our survey platform. Survey respondents are asked 100+ questions. The questions change week to week and so the column names are not consistent. We exclude respondents who appear to be cheating the system (rushing through questions, straight-lining, skipping almost every question, etc.) As part of our completion check, we want to do a row-wise map of the data frame e.g. apply(respondents, 1, completion_function). For the sake of this post, let's say that the desired completion function is simply to return the number of NAs in the row -- sum(is.na(x)). I don't care about the name of the variables, I just want to treat each row as a vector one at a time, then perform the operation, and return it as a mutated variable.

When I wrote code to do this, the preferred tidyverse routes would either be to 1:nrow(df) %>% map(function(index) { row = df[index, ] }) or else df %>% pmap(function(.... laundry list of variables here)) { }. A brief Google shows this has gotten worse in the last few years as dplyr deprecated rowwise operations. I do see stack overflow posts of people writing their own tidy/pipe-friendly row-wise iteration functions, but nothing official.

Maybe I missed something. I am a reasonably competent R programmer and package author, but I don't live and breathe tidyverse data-wrangling the way some people do.

3 comments

wodenokoto 2545 days ago

You are right that tidy data is in a different form that many supplied tables are.

If I understand correctly, you want to know how many NA's there are in each column in a wide-form dataset (as opposed to a tidy dataset)

    # One line to make the data tidy.
    # The form of data will be 3 columns: id, question, answer, and no, we don't care what the columns are called, except for id.
    tidydf <- df %>% gather("question", "answer", -id) 

    # one line to do your check
    tidydf %>% group_by(id) %>% summarise(n_NA = sum(is.na(answer)))

Tidyverse is highly opinionated about its data structure, and it is one of its limiting factors, as it basically treats every dataset as a sparse dataset. This actually fits very well with your data, as a datapoint is not a fixed questionnaire, but rather a datapoint is a respondents answer to a question (as questionnaires vary in questions, a tall table layout is quite fitting).

From there on you have to think in groups and summaries, unless you wanna fight the library.

Tidyverse is an 80% datascience solution. It solves what you need 80% of the time really, really well, and the last 20% you either have to fall back to base R or really torture dplyr.

link

notafraudster 2542 days ago

I agree that an alternative way to do this would be to mutate an ID column (maybe row number), then gather, then summarize, except that this of course will throw away all the rest of the data, so not great if all you want to do is add a column. Hence, I normally map rows or use base R.

link

robust 2545 days ago

Thank you for providing this example! I share your experience that rowwise operations seem more difficult to program using the tidyverse than using apply.

link

cwyers 2546 days ago

map() in purrr is functionally equivalent to lapply(). If you can do something in lapply, you can do it with map.

link

notafraudster 2546 days ago

Right, which is why my example used its cousin, apply, instead of lapply. apply over the row margins of a data frame does not have an equivalent in tidyverse.

link

hadley 2545 days ago

Beware that apply() coerces data frames to matrices, which is time consuming and forces all columns to have the same type.

link