|
We should be clear which of two kinds of scientific code we're talking about: 1. A program that implements a new technique which forms an important part of a research project. Maybe a program that is the research project, which will be described in a paper. No doubt this code should be included with the publication, no matter how "ugly" it is. Some journals, e.g. Bioinformatics, already require that an article about software must include the software itself. This is the stuff the Bioinformatics Testing Consortium would run a smoke test on, because amazingly, a lot of programs that have been written up as journal articles just don't compile or work at all on somebody else's machine; many articles don't include the source code, and some don't even say how to get a redistributable binary. That's wrong, and we can fix it. 2. The mountain of single-use scripts and shell commands that are used in a research project that's not really about software at all, only a small fraction of which produce some output that the scientist follows up on. Key points: (1) this code is very unlikely to work on anyone else's machine as-is; (2) crucial parts of these pipelines are lost in the Bash history, or were executed on a 3rd-party web server, or depend on a data set on loan from a collaborator who is not ready to release the data yet; (3) almost all of the code is dead; (4) whatever comments or notes exist are usually misleading or completely wrong. As an example of what can go wrong when this code is released as-is, remember when the East Anglia Climate Research Unit "hide the decline" stuff hit the fan? It wasn't clear which code was dead, the comments made no sense, and people freaked because they couldn't be sure how the published results came out of that godawful mess. The eventual solution, way too late, was to make a proper open-source, openly developed software project out of the important bits. That, in a nutshell, is why scientists won't release ALL the code -- even the hard drive itself is not the whole story; the scientist still needs to be available to explain it and navigate over the red herrings. And getting code into a state where it's self-explanatory takes time. |
If said scientist can't do that, how does anyone know what was actually run?