| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wodenokoto 2544 days ago

You are right that tidy data is in a different form that many supplied tables are.

If I understand correctly, you want to know how many NA's there are in each column in a wide-form dataset (as opposed to a tidy dataset)

    # One line to make the data tidy.
    # The form of data will be 3 columns: id, question, answer, and no, we don't care what the columns are called, except for id.
    tidydf <- df %>% gather("question", "answer", -id) 

    # one line to do your check
    tidydf %>% group_by(id) %>% summarise(n_NA = sum(is.na(answer)))

Tidyverse is highly opinionated about its data structure, and it is one of its limiting factors, as it basically treats every dataset as a sparse dataset. This actually fits very well with your data, as a datapoint is not a fixed questionnaire, but rather a datapoint is a respondents answer to a question (as questionnaires vary in questions, a tall table layout is quite fitting).

From there on you have to think in groups and summaries, unless you wanna fight the library.

Tidyverse is an 80% datascience solution. It solves what you need 80% of the time really, really well, and the last 20% you either have to fall back to base R or really torture dplyr.

1 comments

notafraudster 2541 days ago

I agree that an alternative way to do this would be to mutate an ID column (maybe row number), then gather, then summarize, except that this of course will throw away all the rest of the data, so not great if all you want to do is add a column. Hence, I normally map rows or use base R.

link