Hacker News new | ask | show | jobs
by jpeloquin 2784 days ago
> Each figure should be an individually published entity which contains the entire computational pipeline.

I agree in principle. But, for the experimental sciences, we need better publication infrastructure to make this practically possible.

For example, consider a figure that summarizes compares, between several groups, the mechanical strain of tensile test specimens for a given load. Strain is measured from digital image correlation of video of the test. Some pain points:

1. There is a few hundred GB of test video underlying the figure. Where should the author put this where it will remain publicly accessible for the useful lifetime of the paper? How long should it remain accessible, anyway? The scientific record is ostensibly permanent, but relying on authors to personally maintain cloud hosting accounts for data distribution will seldom provide more than a couple years' of data availability.

2. Open data hosts that aim for permanent archival of scientific data do exist (e.g., the Open Science Framework), but their infrastructure is a poor match with reproducible practices. I haven't found an open data host that both accepts uploads via git + git annex or git + git LFS and has permissive repository size limits. Often the provided file upload tool can't even handle folders, requiring all files to be uploaded individually. Publishing open data usually requires reorganizing it to according to the data host's worldview or publishing a subset of the data, which breaks the existing computational analysis pipeline.

3. Proprietary software was used in the analysis pipeline. The particular version of the software that was used is no longer sold. It's unclear how someone without the software license would reproduce the analysis.

Finally, there's the issue of computational literacy of scientists. In most cases, the "computational pipeline" is a grad student clicking through a GUI a couple hundred times, and occasionally copying the results into an MS Office document for publication. No version control. Generally, an interactive analysis session cannot be stored and reproduced later. How do we change this? Can we make version control (including of large binary files) user-friendly enough that non-programmers will use it? And make it easy to update Word / PowerPoint documents from the data analysis pipeline instead of relying on copy & paste?

If any of these pain points are in fact solved and my information is out of date, I would be thrilled to hear it.

1 comments

1 ans 2: I like IPFS for this, check it out

3: analysis that uses propriatory is marked appropriately as second class

> computational literacy of scientists

Welp...