| HN Mirror

There's a great motivation for small model work for big-model results: More efficient use of compute can be leveraged to make big models effectively bigger. Small-model architectural innovations are computational leverage. You can even see the convolution operation in this light; it's much more efficient than the 'giant dense matrix' approach.

EfficientNet is an exemplar of this approach; they made much better small models, and wound up with much higher quality big models as a result of having better architecture overall: https://arxiv.org/pdf/1905.11946.pdf

We're currently seeing some great results with more efficient attention layers, which will make the current 'big' models much more efficient... And unlock a next generation of higher quality big models.