|
|
|
|
|
by _hark
653 days ago
|
|
The raw capacity of the network doesn't tell you how complex the weights actually are. The capacity is only an upper bound on the complexity. It's easy to see this by noting that you can often prune networks quite a bit without any loss in performance. I.e. the effective dimension of the manifold the weights live on can be much, much smaller than the total capacity allows for. In fact, good regularization is exactly that which encourages the model itself to be compressible. |
|
Capacity is autological. The amount of information it can express.
Training dynamics are the way the model learns, the optimization process, etc. So this is where things like regularization come into play.
There's also architecture which affects the training dynamics as well as model capacity. Which makes no guarantee that you get the most information dense representation.
Fwiw, the authors did also try distillation.