Hacker News new | ask | show | jobs
by capnrefsmmat 3518 days ago
Well, it doesn't just require them to click a menu item and hit a button -- then they have to fix all the problems that arose because RStudio encouraged a hackish development style. There's a workaround, but RStudio still actively encourages your workspace to get out of sync from your script. Compare that to DrRacket, where the code is labeled the "Definitions" window, and every time you reload the definitions, your workspace starts from scratch. You can't accidentally interact with deleted code.

This would be a problem in a straight-line script where you're doing a bunch of data munging and analysis, since reloading from scratch might mean redoing expensive computations. But if you're building clean, reusable functions to implement interesting algorithms -- building a package and not a script -- then it's exactly the behavior you want.

Our course very much focuses on software engineering. Major topics include writing modular code, object-oriented design, thorough testing, and version control. We don't cover statistics concepts in the class -- it's computing for statisticians, not computational methods in statistics. We believe that teaching statisticians to compute like software engineers will, in the long term, dramatically improve their work, since they'll have a stable base of robust, modular, well-tested, reusable code.

One recent project, for example, required students to write a pipeline of scripts: one script takes the name of a CSV file as a command-line argument, processes and filters the data, and dumps it on STDOUT so the next script can read from STDIN and load the data into PostgreSQL, so another script (an R Markdown document) can do some queries and generate an automated report on the new batch of data. The processing and analysis stages have to be written as functions, not just top-level scripts, so they can be thoroughly tested.

A future project will involve using dual k-d trees for fast approximate kernel density estimation, or building R trees to efficiently query spatial data. These are definitely more like packages than scripts.

1 comments

It is not that we encourage a "hackish development style", but computer scientists and statisticians/data analysts are solving different problems, and statisticians' primary job is often not software development. There is not a single absolutely correct style for both groups. You should not expect statisticians to be professional software engineers, or vice versa. We can learn good practice from each other. Statisticians and data analysts often use the EDA approach (Exploratory Data Analysis), and it makes sense to "pollute" the workspace temporarily. Running everything from scratch feels like using punch cards, which is related to the history of S (which in turn inspired R). Statisticians at Bell Labs found it tedious to throw a program to a machine, wait for a day, get hundreds of pages of output the next day, read the output by eyes, modify the program, and do it again. They wanted instant feedback (plots/summary tables) as they explore the data.

We take reproducibility very seriously. The fact that RStudio's Knit button uses a new R session, instead of the current R session, to compile R Markdown documents was a deliberate choice to make sure your output is produced from a clean R session. But if you are doing EDA, it may not be very pleasant to click this button over and over again every time you update your code (you can if you want).

If your course is focused on software engineering, everything you said makes perfect sense. Statisticians can learn the good principles in CS, but they are statisticians after all. There must be tradeoffs.