|
|
|
|
|
by knexer
3125 days ago
|
|
Two things stick out to me after a first read: First, this actually learns a schedule for each hyperparameter, not just a good set of fixed values, automatically discovering learning rate annealing and related techniques. This seems incredibly powerful. It is also learning hyperparameter schedules specific to a single training run - which seems interesting but not obviously helpful, especially since many of the learned schedules fairly closely match the baseline hand-tuned ones. Second, it seems like they're optimizing against their validation metric directly; isn't that basically 'cheating' (i.e. defeats much of the point of having a separate validation metric in the first place)? It also seems completely orthogonal to their technique - could they not have optimized for the same loss function as the network itself? Is this an improvement over state of the art, or is it just overfitting to the validation metric? |
|
As for regular supervised learning: it's no worse than, say, early stopping based on validation scores. It should be wrong but in practice NNs generalize anyway, and since this paper implies that Google Brain & DM are doing this hyperparameter optimization routinely now for everything, I figure that they would have noticed any overfitting problems by now (either when the methods fail to outperform on one of Google's private internal huge databases, or when they rolled outth the translator).