|
|
|
|
|
by TuringNYC
3383 days ago
|
|
I don't want to undermine visualizations, they are awesome, but one of the big problems I see with ML research is the lack of re-produceability. I know that Google, Facebook and some others already share associated source repos, but it should almost be mandatory when working with public benchmark datasets. Source + Docker Images would be even better. I worked in clinical research in a past life and studies would be highly discounted if they couldn't be reproduced. A highly detailed methods section was key. Many ML papers I see tend to have incredibly formalized LaTeX+Greek obsessed methods section, but far short of anything to allow reproduction. Some ML papers, i swear must have run their parameter searches a 1000 times to overfit and magically achieve 99% AUC. Worse, I actually have tons of spare GPU farm capacity i'd love to devote to re-producing research, tweaking, trying it on adjacent datasets, etc. But the effort to re-produce is too high for most papers. It is also disappointing to see various input datasets strewn about individuals' personal homepages, and sometimes end up broken. Sometimes the "original" dataset is in a pickled form after having already gone through multiple upstream transformations. I hope Distill can instill some good best practices to the community. |
|
It seems clear to me that the future will involve some kind of linking reproducibility to papers. If we want to find that future, we need a way for people to experiment with what a publication is.