|
|
|
|
|
by bosco_mcnasty
847 days ago
|
|
your point is valid but the paper explains it clearly and obviously. they are NOT dimensionally reduced hyperparameters, no. The hyperparameters are learning rates, that's it. X axis, learning rate for input (1 hidden layer). Y axis, learning rate for output layer. So what this is saying, for certain ill-chosen learning weights, model convergence is for lack of a better word, chaotic and unstable. |
|
Training consists of 500 (sometimes 1000) iterations of full batch steepest gradient descent. Training is performed for a 2d grid of η0 and η1 hyperparameter values, with all other hyperparameters held fixed (including network initialization and training data).