Hacker News new | ask | show | jobs
by cwhittle 5225 days ago
Speaking as a scientist who deals with genomic data, I wholeheartedly agree with many of the comments here. Code and raw data should be available at publication. I shouldn't have to try and figure out what you did from the three lines of text and poorly documented software you mention (that has been updated several times since you used it (no mention of version). Personally, I think pseudo-code would be most useful for reproducibility and for illustrating exactly what your program does.

Let me add to a few points here about the practical obstacles to this.

1) Journals don't support this data (raw data or software).

* You can barely include the directly relevant data in your paper let alone anything additional you might have done. Methods are fairly restricted and there is no format for supplemental data/methods. Unless your paper is about a tool, then they don't want the details, they just want benchmarks. Yes, you can put it on your website, but websites change; there are so many broken links to data/software in even relatively new articles.

* As many people have said, lots of scientific processing is one-off type scripting. I need this value or format or transform, so I write a script to get that.

2) Science turns over fast or faster than the lifetimes of most development projects.

* A postdoc or grad student wrote something to deal with their dataset at the time. Both the person and the data have since moved on. The sequencing data has become higher resolution or changed chemistry and output, so its all obsolete. The publication timeline of the linked article illustrates this. For an just an editorial article it took 8 1/2 months from submission to publication. Now add the time it took to handle the data and write the paper prior to that and you're several years back. The languages and libraries that were used have all been through multiple updates and your program only works with Python 2.6 with some library that is no longer maintained. Even data repositories such as GEO (http://www.ncbi.nlm.nih.gov/geo/) are constantly playing catch-up for the newest datatypes. Even their required descriptions for methodology for data-processing are lacking.

3) Many scientists (and their journals and funding institutions, which drive most changes) don't respect the time or resources it takes to be better coders and release that data/code in a digestible format.

* Why should I make my little program accept dynamic input or properly version with commentary if that work is just seen as a means to an end rather than as an integral part of the conclusions drawn. The current model of science encourages these problems. This last point might be specific to the biology-CS gap.