|
|
|
|
|
by Kalanos
1436 days ago
|
|
This is why model simplicity is so important. When an algorithm has less parameters, it's forced to use those weights to find the most broadly applicable patterns possible, as opposed to noise, in the training data "Why might SGD prefer basins that are flatter?"
It's because they look at the derivative. When the bottom of the valley is flat they don't have enough momentum to get out. I have observed the lottery ticket hypothesis. |
|