Hacker News new | ask | show | jobs
by mjw 4124 days ago
Yeah I was thinking about this after I posted. Not entirely convinced though -- I want the hyperparameters I learn to generalise to unseen data, just like plain old parameters. If there are two methods for learning them then I'm going to pick the one which performs best on unseen data and I'd like a metric which helps me make that choice.

Sure, you can evaluate them purely as optimisation algorithms, but does it follow that the better optimisation algorithm is necessarily better at picking hyperparameters that generalise to unseen data?

One way that hyperparameter optimisation can overfit that people don't always think about, is by repeatedly evaluating high-variance metrics and picking the best of N tries. This has burned me when it comes to optimising settings for stochastic optimisation algorithms for example. An algorithm that was very aggressive in doing this might reach a better maximum on the validation set but wouldn't do any better on held-out data.

There are things you can do to compensate for that of course (variance estimates for metrics is a good idea!), but evaluating on a test set data usually doesn't hurt and seems like the safest option.

1 comments

>If there are two methods for learning them then I'm going to pick the one which performs best on unseen data and I'd like a metric which helps me make that choice.

But both methods will converge to the exact same set of hyper parameters, the ones that are optimal for the validation set. The only difference is some methods are faster.