Hacker News new | ask | show | jobs
by hoosieree 695 days ago
If you're lucky, someone releases code+data associated with their published paper. If you're really lucky, that code and data is in the same state as it was in the published paper. If you're really really lucky, someone besides the author can get it to run.

If you can consistently locate and run academic publication code without direct help from the authors, you are The Chosen One.

[edit]

In seriousness, reproducibility is also my biggest concern. Scientific/academic publishing could do a lot better than rendering pretty static documents - we can provide the data, code, version control, and build processes which produced the paper so anyone can reproduce what they see in the paper. AND we could host them together so they're bidirectionally linked, to facilitate other scientists building on top of our work.

That could be our future, with the right incentive structures in place.

1 comments

Isn't that the idea (or perhaps the promise) of languages like R or notebook tools like Jupyter or Collab, which provide a means to ingest, clean, analyse and present your data, then share the code you've used to do that.
Notebooks aren't very git-friendly, so in practice you rarely know which version produced the paper.

The fact you can run notebook cells out-of-order exacerbates this problem. Not only do you not know what version the entire file was, you also don't know in what order or how many times each cell within the file executed in order to produce the plots you see in the paper.

This isn't to discount the improvement in UX that you get from notebooks compared to my preferred alternative (emacs with org-mode). Maybe I'm just bitter that the ipynb format exists at all. If notebooks were just a UX layered on top of emacs+org-mode, that would fix most of the core issues.

I like notebooks, they are a useful tool. But they are just a slight adjustment to the programming model and an alternative type IDE. It does not do much in terms of helping reproducibility. Data, software and dependency versioning is much more important. And verification that the code indeed runs on another machine, and produces the correct results. Setting up CI for the project, and basic end2end tests is the minimum level I set for my research (in applied machine learning).