|
|
|
|
|
by yihui
3510 days ago
|
|
Isn't this a simple instruction you give students in the very first class like "before you submit your homework, restart R session, and make sure your submission runs in the new session"? This only requires them to click a menu item (Restart R session), and a button (Knit or Source or something). Not really a burden for them, but will save your life as the instructor. As someone who had been a student in statistics for more than 10 years, I confess I had never written a single test for my homework. Frankly I just didn't have the time or interest (too much homework, and becoming a professional software engineer was not the goal of the homework assignments). That said, when I put on my software engineer hat now at work, I'd definitely do what you advertise here and write tests carefully. If you want your students to enjoy the benefits of both R packages and R Markdown, I wrote some thoughts here a couple of years ago: http://yihui.name/rlp/ Don't get me wrong. I'll all for teaching students good practice of software engineering. I just want to speak from my own memories and experience as a student. Sometimes I feel teachers are like parents: they want kids to learn all possible right things, no matter if they are practically able to swallow all the good stuff (sometimes this has bad psychological consequences, like rebellious children). If I were an instructor in statistics, I'd only require students to submit an R Markdown document. Other things like tests can earn extra credits but not required. |
|
This would be a problem in a straight-line script where you're doing a bunch of data munging and analysis, since reloading from scratch might mean redoing expensive computations. But if you're building clean, reusable functions to implement interesting algorithms -- building a package and not a script -- then it's exactly the behavior you want.
Our course very much focuses on software engineering. Major topics include writing modular code, object-oriented design, thorough testing, and version control. We don't cover statistics concepts in the class -- it's computing for statisticians, not computational methods in statistics. We believe that teaching statisticians to compute like software engineers will, in the long term, dramatically improve their work, since they'll have a stable base of robust, modular, well-tested, reusable code.
One recent project, for example, required students to write a pipeline of scripts: one script takes the name of a CSV file as a command-line argument, processes and filters the data, and dumps it on STDOUT so the next script can read from STDIN and load the data into PostgreSQL, so another script (an R Markdown document) can do some queries and generate an automated report on the new batch of data. The processing and analysis stages have to be written as functions, not just top-level scripts, so they can be thoroughly tested.
A future project will involve using dual k-d trees for fast approximate kernel density estimation, or building R trees to efficiently query spatial data. These are definitely more like packages than scripts.