Hacker News new | ask | show | jobs
by etal 5037 days ago
We should be clear which of two kinds of scientific code we're talking about:

1. A program that implements a new technique which forms an important part of a research project. Maybe a program that is the research project, which will be described in a paper.

No doubt this code should be included with the publication, no matter how "ugly" it is. Some journals, e.g. Bioinformatics, already require that an article about software must include the software itself. This is the stuff the Bioinformatics Testing Consortium would run a smoke test on, because amazingly, a lot of programs that have been written up as journal articles just don't compile or work at all on somebody else's machine; many articles don't include the source code, and some don't even say how to get a redistributable binary. That's wrong, and we can fix it.

2. The mountain of single-use scripts and shell commands that are used in a research project that's not really about software at all, only a small fraction of which produce some output that the scientist follows up on.

Key points: (1) this code is very unlikely to work on anyone else's machine as-is; (2) crucial parts of these pipelines are lost in the Bash history, or were executed on a 3rd-party web server, or depend on a data set on loan from a collaborator who is not ready to release the data yet; (3) almost all of the code is dead; (4) whatever comments or notes exist are usually misleading or completely wrong.

As an example of what can go wrong when this code is released as-is, remember when the East Anglia Climate Research Unit "hide the decline" stuff hit the fan? It wasn't clear which code was dead, the comments made no sense, and people freaked because they couldn't be sure how the published results came out of that godawful mess. The eventual solution, way too late, was to make a proper open-source, openly developed software project out of the important bits. That, in a nutshell, is why scientists won't release ALL the code -- even the hard drive itself is not the whole story; the scientist still needs to be available to explain it and navigate over the red herrings. And getting code into a state where it's self-explanatory takes time.

1 comments

> That, in a nutshell, is why scientists won't release ALL the code -- even the hard drive itself is not the whole story; the scientist still needs to be available to explain it and navigate over the red herrings.

If said scientist can't do that, how does anyone know what was actually run?

That's why we write papers. Plain English can be more coherent than a pile of code.
> That's why we write papers. Plain English can be more coherent than a pile of code.

"Plain english" doesn't analyze data - software does.

If the software is a mess, how likely is it that the "plain English" description is correct? How do you know? Why should anyone believe that the description is correct?

Code is truth.

Right, which is why the novel parts should get more attention and undergo code review, which is the goal of the Bioinformatics Testing Consortium.

To be clear, I'm all for open science and even open notebooks where it's a good fit for the project. I just don't think a pile of single-use scripts is a sufficient replacement for a clear English description of the analysis workflow and the reasons for each step. If I can't understand how an analysis was done from the article itself and the documentation for any associated software, I would not trust the article. Including more code, particularly the code further down the Pareto curve of relevance to the final article, does not make the article more correct -- most journal articles are wrong or flawed in some way, even if the code works as advertized.