|
|
|
|
|
by godelski
661 days ago
|
|
I think your confusing capacity with the training dynamics. Capacity is autological. The amount of information it can express. Training dynamics are the way the model learns, the optimization process, etc. So this is where things like regularization come into play. There's also architecture which affects the training dynamics as well as model capacity. Which makes no guarantee that you get the most information dense representation. Fwiw, the authors did also try distillation. |
|
> With enough computation, your neural net weights would converge to some very compressed latent representation of the source code of DOOM. Maybe smaller even than the source code itself? Someone in the field could probably correct me on that.
And they're not wrong! An ideally trained network could, in principle, learn the data-generating program, if that program is within its class of representable functions. I might have a NN that naively looks like it takes up GBs of space, but it might actually be parameterizing a much simpler function (hence our ability to prune/compress the weights without performance loss - most of the capacity wasn't being used for any interesting computation).
You're right that there's no guarantee that the model finds the most "dense" representation. The goal of regularization is to encourage that, though!
All over the place in ML there are bounds like:
test loss <= train loss + model complexity
Hence minimizing model complexity improves generalization performance. This is a kind of Occam's Razor: the simplest model generalizes best. So the OP is on the right track - we definitely want networks to learn the "underlying" process that explains the data, which in this case would be a latent representation of the source code (well, except that doesn't really make sense since you'd need the whole rest of the compute stack that code runs on - the neural net has no external resources/embodied complexity it calls, unlike the source code which gets to rely on drivers, hardware, operating systems, etc.)