|
|
|
|
|
by sdenton4
2024 days ago
|
|
These things matter a lot in practice. Imagine a giant million dimensional loss surface, where each point is a set of weights for the model. Then the gradient is pushing us around on this surface, trying to find a 'minimum.' Current understanding (for a while, actually) is that you never really hit minima so much as giant mostly-flat regions where further improvement maybe takes a million years. The loss surfaces for models with skip connections seem to be much, much nicer. https://papers.nips.cc/paper/2018/file/a41b3bb3e6b050b6c9067... In effect, there's a big gap between an existence proof and actually workable models, and the tricks of the trade do quite a lot to close the gap. (And there are almost certainly more tricks that we're still not aware of! I'm still amazed at how late in the game batch normalization was discovered.) OTOH, so long as you're using the basic tricks of the trade, IME architecture doesn't really matter much. Our recent kaggle competition for birdsong identification was a great example of this: pretty much everyone reported that the difference between five or so 'best practices' feature extraction architectures (various permutations of resnet/efficientnet) was negligible. |
|
Can we do the same with neural networks? Is there a formalization of why 'skip connections' (which I know nothing about) are better, why transformers are more efficient than recurrance, etc?
Is it useful to talk about their complexity or universal properties or size (and I realize this is muddled up a bit by the fact that hardware architecture can sometimes trump efficiency).