Hacker News new | ask | show | jobs
by yihui 3510 days ago
Isn't this a simple instruction you give students in the very first class like "before you submit your homework, restart R session, and make sure your submission runs in the new session"? This only requires them to click a menu item (Restart R session), and a button (Knit or Source or something). Not really a burden for them, but will save your life as the instructor.

As someone who had been a student in statistics for more than 10 years, I confess I had never written a single test for my homework. Frankly I just didn't have the time or interest (too much homework, and becoming a professional software engineer was not the goal of the homework assignments). That said, when I put on my software engineer hat now at work, I'd definitely do what you advertise here and write tests carefully. If you want your students to enjoy the benefits of both R packages and R Markdown, I wrote some thoughts here a couple of years ago: http://yihui.name/rlp/

Don't get me wrong. I'll all for teaching students good practice of software engineering. I just want to speak from my own memories and experience as a student. Sometimes I feel teachers are like parents: they want kids to learn all possible right things, no matter if they are practically able to swallow all the good stuff (sometimes this has bad psychological consequences, like rebellious children). If I were an instructor in statistics, I'd only require students to submit an R Markdown document. Other things like tests can earn extra credits but not required.

2 comments

Well, it doesn't just require them to click a menu item and hit a button -- then they have to fix all the problems that arose because RStudio encouraged a hackish development style. There's a workaround, but RStudio still actively encourages your workspace to get out of sync from your script. Compare that to DrRacket, where the code is labeled the "Definitions" window, and every time you reload the definitions, your workspace starts from scratch. You can't accidentally interact with deleted code.

This would be a problem in a straight-line script where you're doing a bunch of data munging and analysis, since reloading from scratch might mean redoing expensive computations. But if you're building clean, reusable functions to implement interesting algorithms -- building a package and not a script -- then it's exactly the behavior you want.

Our course very much focuses on software engineering. Major topics include writing modular code, object-oriented design, thorough testing, and version control. We don't cover statistics concepts in the class -- it's computing for statisticians, not computational methods in statistics. We believe that teaching statisticians to compute like software engineers will, in the long term, dramatically improve their work, since they'll have a stable base of robust, modular, well-tested, reusable code.

One recent project, for example, required students to write a pipeline of scripts: one script takes the name of a CSV file as a command-line argument, processes and filters the data, and dumps it on STDOUT so the next script can read from STDIN and load the data into PostgreSQL, so another script (an R Markdown document) can do some queries and generate an automated report on the new batch of data. The processing and analysis stages have to be written as functions, not just top-level scripts, so they can be thoroughly tested.

A future project will involve using dual k-d trees for fast approximate kernel density estimation, or building R trees to efficiently query spatial data. These are definitely more like packages than scripts.

It is not that we encourage a "hackish development style", but computer scientists and statisticians/data analysts are solving different problems, and statisticians' primary job is often not software development. There is not a single absolutely correct style for both groups. You should not expect statisticians to be professional software engineers, or vice versa. We can learn good practice from each other. Statisticians and data analysts often use the EDA approach (Exploratory Data Analysis), and it makes sense to "pollute" the workspace temporarily. Running everything from scratch feels like using punch cards, which is related to the history of S (which in turn inspired R). Statisticians at Bell Labs found it tedious to throw a program to a machine, wait for a day, get hundreds of pages of output the next day, read the output by eyes, modify the program, and do it again. They wanted instant feedback (plots/summary tables) as they explore the data.

We take reproducibility very seriously. The fact that RStudio's Knit button uses a new R session, instead of the current R session, to compile R Markdown documents was a deliberate choice to make sure your output is produced from a clean R session. But if you are doing EDA, it may not be very pleasant to click this button over and over again every time you update your code (you can if you want).

If your course is focused on software engineering, everything you said makes perfect sense. Statisticians can learn the good principles in CS, but they are statisticians after all. There must be tradeoffs.

> Isn't this a simple instruction you give students in the very first class like "before you submit your homework, restart R session, and make sure your submission runs in the new session"? This only requires them to click a menu item (Restart R session), and a button (Knit or Source or something).

It's the nature of learners to make mistakes. The more things they have to remember to do, the less cognitive power they'll have to focus on what they're trying to learn.

That's a good point, though in this case the thing they need to remember is a key part of doing the job. Seeing a line of code run correctly once does not mean it's correct. It's one of those concepts that comes up in many forms.

Perhaps one way to make a teachable moment of it is to help them set up a baby CI environment. Then every time it catches something the value of good practices is driven home.

That demand is the equivalent of pressing two buttons. (and yes, in my stats course as an undergrad I was also requested to reload and rerun R sessions to ensure code was correct)