Hacker News new | ask | show | jobs
by hadley 4944 days ago
Would love to hear what you find most painful about data munging/wrangling and unit testing. It's something that I've been trying to improve in R (e.g. http://vita.had.co.nz/papers/tidy-data.html and http://journal.r-project.org/archive/2011-1/RJournal_2011-1_...)
1 comments

Do you think the love it/hate it dichotomy over R for data 'munging' stems from different ways of thinking about data. I'm slowly getting comfortable in R since returning to work in a sort of freelance arrangement that makes me highly motivated to use free or affordable tools. I started out, however, in clinical epidemiology data analysis using MS Access and SAS. I still think of data in terms of rectangular data sets, RDBMS and sql. I have a hard time with vector and matrix related terminology. I think I'm going to end up using reshape2 and data.table a lot since sqldf is noticeably slower even with my small data sets (compared with web analytics, finance, etc). The problem with sqldf and variable names containing a dot is a real drag as I try to adopt good coding style. I am missing the clarity and familiarity of sql statements, though, as I try to find my new workflow in R. I hope a more unified approach to data munging emerges soon. BTW, I totally espouse the reproducible research (RR) method of documenting study design, analysis, interpretation... I am loving knitr and latex for RR so I can no longer imagine using different tools for data munging and analysis.
Correction: I should have said 'literate programming' instead of 'reproducible research', since I'm not in a position to follow all components of RR.