Hacker News new | ask | show | jobs
by TuringNYC 3383 days ago
I don't want to undermine visualizations, they are awesome, but one of the big problems I see with ML research is the lack of re-produceability. I know that Google, Facebook and some others already share associated source repos, but it should almost be mandatory when working with public benchmark datasets. Source + Docker Images would be even better.

I worked in clinical research in a past life and studies would be highly discounted if they couldn't be reproduced. A highly detailed methods section was key. Many ML papers I see tend to have incredibly formalized LaTeX+Greek obsessed methods section, but far short of anything to allow reproduction. Some ML papers, i swear must have run their parameter searches a 1000 times to overfit and magically achieve 99% AUC.

Worse, I actually have tons of spare GPU farm capacity i'd love to devote to re-producing research, tweaking, trying it on adjacent datasets, etc. But the effort to re-produce is too high for most papers.

It is also disappointing to see various input datasets strewn about individuals' personal homepages, and sometimes end up broken. Sometimes the "original" dataset is in a pickled form after having already gone through multiple upstream transformations. I hope Distill can instill some good best practices to the community.

1 comments

I think that having a venue that can publish non-traditional academic artifacts is an important step for reproducibility, even if it isn't our focus.

It seems clear to me that the future will involve some kind of linking reproducibility to papers. If we want to find that future, we need a way for people to experiment with what a publication is.

Jupyter notebooks are a big piece of solving ML reproducability, it feels like.
I see this a lot, but I disagree, at least in their current form. They miss a variety of very key parts for reproducibility (which, to be fair, was not their original goal).

* Dependencies like libraries are not specified anywhere.

* Dependencies on local code are not bundled.

* Dependencies on local data are not bundled.

* Underlying requirements like LLVM (which needs to be specifically 3.9.X for llvmlite in python as I discovered recently).

* Perhaps most dangerously, you can run the code sections out of order, and deleted sections will leave their variables around which can interfere with the run. I've been caught out by this in my own notebooks.

I really like jupyter notebooks, but I think some of the design decisions (correct for some ways of working) actively work against reproducible reports.

There was a recent writeup here:

> we were able to successfully execute only one of the ~25 notebooks that we downloaded.

https://markwoodbridge.com/2017/03/05/jupyter-reproducible-s...

Right, "a part" was important. Looks like the authors of that writeup agree.

> Technologies such as Jupyter and Docker present great opportunities to make digital research more reproducible, and authors who adopt them should be applauded.

I somewhat disagree that it's a big part or even really should be a part of the solution, I'm really not sure that these notebooks are the right approach to making reproducible research. The conclusion there doesn't seem supported by their findings, to me.

I think they solve a different use case well, and forcing them into a workflow they weren't designed for may just result in both less useful workbooks and a poor experience.

Edit - To expand a little, jupyter notebooks are nice to mix code and descriptions, and in essence force people to release a certain amount of their code. But other than that they actually provide fewer of the guarantees that you want from things for reproducibility. And since the goals for reproducibility generally force more restrictions on how you work, I can see there being more issues for trying to match these different ways of working.

I don't see how there are any features which are useful for the goal of making things reproducible, and as such why people keep bringing them up as a solution.

The main steps would seem to be

1. Make sure the results used are not generated on "my machine" but on a specified base run somewhere else. Just like we don't take the unit test results I run locally as gospel.

2. Unique and versioned identifiers for code, base system and data.

3. Archived code and data.

4. An agreed on format in the output data to say where it came from (which references the identifier(s) for the code, base system used and input data)

Your output might be a rendered notebook, but the notebook itself is entirely orthogonal to the process, as what a notebook provides is:

* A nice interface for entering the code

* A nice output format

* A neat way of mixing nicely written documentation along with the code