|
|
|
|
|
by notafraudster
2546 days ago
|
|
I run a large scale national survey. We download the data from our survey platform. Survey respondents are asked 100+ questions. The questions change week to week and so the column names are not consistent. We exclude respondents who appear to be cheating the system (rushing through questions, straight-lining, skipping almost every question, etc.) As part of our completion check, we want to do a row-wise map of the data frame e.g. apply(respondents, 1, completion_function). For the sake of this post, let's say that the desired completion function is simply to return the number of NAs in the row -- sum(is.na(x)). I don't care about the name of the variables, I just want to treat each row as a vector one at a time, then perform the operation, and return it as a mutated variable. When I wrote code to do this, the preferred tidyverse routes would either be to 1:nrow(df) %>% map(function(index) { row = df[index, ] }) or else df %>% pmap(function(.... laundry list of variables here)) { }. A brief Google shows this has gotten worse in the last few years as dplyr deprecated rowwise operations. I do see stack overflow posts of people writing their own tidy/pipe-friendly row-wise iteration functions, but nothing official. Maybe I missed something. I am a reasonably competent R programmer and package author, but I don't live and breathe tidyverse data-wrangling the way some people do. |
|
If I understand correctly, you want to know how many NA's there are in each column in a wide-form dataset (as opposed to a tidy dataset)
Tidyverse is highly opinionated about its data structure, and it is one of its limiting factors, as it basically treats every dataset as a sparse dataset. This actually fits very well with your data, as a datapoint is not a fixed questionnaire, but rather a datapoint is a respondents answer to a question (as questionnaires vary in questions, a tall table layout is quite fitting).From there on you have to think in groups and summaries, unless you wanna fight the library.
Tidyverse is an 80% datascience solution. It solves what you need 80% of the time really, really well, and the last 20% you either have to fall back to base R or really torture dplyr.