Hacker News new | ask | show | jobs
by jfrankle 2673 days ago
This is an illuminating (and notably rigorous) read for anyone interested in neural network sparsity and compression. But - equally importantly - it's a valuable read for anyone interested in the replicability of neural network research in general. The authors make clear the urgent need to evaluate research (and reevaluate received wisdom) on networks of the scale and complexity used in practice. I hope this paper will spark some important conversations in the community about our standards for assessing new ideas (mine included). As this paper makes exceedingly clear, plenty of techniques and behaviors for MNIST and CIFAR10 manifest differently (if at all) in industrial-scale settings.

My biggest question coming out of this work was as follows: which small scale (or - at the very least - inexpensive) benchmarks share enough properties in common with these large scale networks that we should expect results to scale with reasonable fidelity? Resnet50 is still far too slow and expensive to use as a day-to-day research network in academia, let alone transformer. Personally, I've found resnet18 on CIFAR10 to pretty reliably predict behavior on resnet50 on imagenet, but that's anecdotal. For the academics who can't drop hundreds of thousands of dollars (or more) on each paper but still want to contribute to research progress, we should carefully assess (or design) benchmarks with this property in mind.

(With respect to the lottery ticket hypothesis, we have a complimentary ICML submission about its behavior on large-scale networks coming shortly!)

1 comments

I think the goal should be use the smallest dense network possible as the baseline. For MNIST, this might be a LeNet style convnet with [3, 9, 50] instead of the [20, 50, 500] network which is standard (and way overkill).

I haven't explored on CIFAR, but my guess is that using a more efficient architecture like mobilenetv2 would yield more likely to transfer results.

The general theme is that you should be using the smallest dense model you possibly can as a baseline.