Hacker News new | ask | show | jobs
by mjw 4128 days ago
Nice post, couple of bits of feedback:

When you talk about "fit" it sounds like you mean fit to the training data, which would obviously be a bad thing to optimise hyperparameters for. From the github repo it sounds like you are using a held-out validation set, but maybe worth being clear about this (e.g. call it something like "predictive performance on validation set").

When you've optimised over hyper-parameters using a validation set, you need to hold out a further test set and report results of your optimised hyperparameter settings on that test set, rather than just report the best achieved metric on the validation set. Is that what you did here? Maybe worth a mention.

A question about sigopt: how do you compare to open-source tools like hyperopt, spearmint and so on? Do you have proprietary algorithms? Are there classes of problems which you do better or worse on? Or is it more about the convenience ?

2 comments

Thanks for the feedback! We'll update the post to be more clear about this. Our goal was to make this as simple and accessible as possible and in doing so we may have cut out too much. I'll try to make this more clear.

SigOpt [0] vs OSS: We are similar to hyperopt, spearmint, and MOE [1], which we developed at Yelp and also uses Gaussian Processes to do Bayesian Global Optimization. SigOpt extends and expands upon our work on MOE while wrapping everything in a simple API [2] and web interface. One thing we learned while promoting MOE was that many people have this problem, but few have the time or expertise to get these expert level open source tools running properly, so we built SigOpt to bring these powerful tools to anyone via a simple API.

[0]: https://sigopt.com

[1]: https://github.com/Yelp/MOE

[2]: https://sigopt.com/docs

>When you've optimised over hyper-parameters using a validation set, you need to hold out a further test set and report results of your optimised hyperparameter settings on that test set, rather than just report the best achieved metric on the validation set.

It is possible to overfit hyperparameters. However that's beyond the scope of these methods, whose only goal is to find the best settings for the validation set. So comparing their validation scores is fair, and the test scores could potentially be misleading.

Yeah I was thinking about this after I posted. Not entirely convinced though -- I want the hyperparameters I learn to generalise to unseen data, just like plain old parameters. If there are two methods for learning them then I'm going to pick the one which performs best on unseen data and I'd like a metric which helps me make that choice.

Sure, you can evaluate them purely as optimisation algorithms, but does it follow that the better optimisation algorithm is necessarily better at picking hyperparameters that generalise to unseen data?

One way that hyperparameter optimisation can overfit that people don't always think about, is by repeatedly evaluating high-variance metrics and picking the best of N tries. This has burned me when it comes to optimising settings for stochastic optimisation algorithms for example. An algorithm that was very aggressive in doing this might reach a better maximum on the validation set but wouldn't do any better on held-out data.

There are things you can do to compensate for that of course (variance estimates for metrics is a good idea!), but evaluating on a test set data usually doesn't hurt and seems like the safest option.

>If there are two methods for learning them then I'm going to pick the one which performs best on unseen data and I'd like a metric which helps me make that choice.

But both methods will converge to the exact same set of hyper parameters, the ones that are optimal for the validation set. The only difference is some methods are faster.