Hacker News new | ask | show | jobs
by tomtranter 697 days ago
My biggest frustration as an academic was reproducibility of papers I was reading. The pdf is such a useless medium for information transfer and the academic publishing industry is a complete racket where all the value is generated by the authors and reviewers who work for free and have to pay (in most cases) to have their work accessible freely to the public. I would love to see this turn into a default way to publish papers
1 comments

If you're lucky, someone releases code+data associated with their published paper. If you're really lucky, that code and data is in the same state as it was in the published paper. If you're really really lucky, someone besides the author can get it to run.

If you can consistently locate and run academic publication code without direct help from the authors, you are The Chosen One.

[edit]

In seriousness, reproducibility is also my biggest concern. Scientific/academic publishing could do a lot better than rendering pretty static documents - we can provide the data, code, version control, and build processes which produced the paper so anyone can reproduce what they see in the paper. AND we could host them together so they're bidirectionally linked, to facilitate other scientists building on top of our work.

That could be our future, with the right incentive structures in place.

Isn't that the idea (or perhaps the promise) of languages like R or notebook tools like Jupyter or Collab, which provide a means to ingest, clean, analyse and present your data, then share the code you've used to do that.
Notebooks aren't very git-friendly, so in practice you rarely know which version produced the paper.

The fact you can run notebook cells out-of-order exacerbates this problem. Not only do you not know what version the entire file was, you also don't know in what order or how many times each cell within the file executed in order to produce the plots you see in the paper.

This isn't to discount the improvement in UX that you get from notebooks compared to my preferred alternative (emacs with org-mode). Maybe I'm just bitter that the ipynb format exists at all. If notebooks were just a UX layered on top of emacs+org-mode, that would fix most of the core issues.

I like notebooks, they are a useful tool. But they are just a slight adjustment to the programming model and an alternative type IDE. It does not do much in terms of helping reproducibility. Data, software and dependency versioning is much more important. And verification that the code indeed runs on another machine, and produces the correct results. Setting up CI for the project, and basic end2end tests is the minimum level I set for my research (in applied machine learning).