Hacker News new | ask | show | jobs
by Djrhfbfnsks 1682 days ago
For most of the problems for which I've tried to use Bayesian Optimization, I've had poor results because of unknown and heterogeneous noise in the underlying process that I'm trying to optimize.

I believe that modeling the noise directly using a 2nd Gaussian Process [1] could help, but I haven't gotten reliable results. I was hoping this topic would be addressed in the book, but don't see it.

[1] https://rdrr.io/cran/hetGP/

2 comments

I spent a long time trying to implement Noisy Baysian Optimization [1], using both standard libraries and my own understanding, but ultimately I never got it to work very well.

It's a real pity, since a smart optimizer for very noisy functions would be really useful. I was trying to use it for chess engine tuning, since I know Deep Mind used it for tuning AlphaZero. I really wonder how they got it to work well.

[1] https://github.com/thomasahle/noisy-bayesian-optimization

Were you trying to do optimization using a binary outcome? I am not familiar with many packages that do this out of the box. Your implementation looks like a good start for the binary case, but you will get better results by computing expected improvement on the probability of winning, (i.e., Phi(f(x)), rather than f(x), where Phi() is the standard normal CDF). For more on why, see http://proceedings.mlr.press/v28/tesch13.pdf and https://arxiv.org/pdf/2110.09361.pdf. AFAIK this should be as simple as defining a simple MCAcquisitionObjective (https://botorch.org/docs/objectives) that passes samples through a torch.distributions.normal CDF. You can then just pass that objective into qNoisyExpectedImprovement. Feel free to open a botorch GH issue if you try this out or need help!

I would also recommend using many more raw_samples and random_restarts in opimize_acqf(). 512 and 20, respectively, are good defaults. The current values are much too small to be able to effectively optimize the acquisition functions.

Yes, I'm using a binary outcome, since that's what I get from playing a game. To get probabilities I'd have to play a lot of games with the same settings/features/point and take the mean, but it seems that defeats the point of Bayesian optimization finding the best point to evaluate for each iteration.

The SPSA method seems to work quite well with binary outcomes. This is what I was trying to beat. Unfortunately I was never able to converge faster than SPSA (or even close to that) even increasing the number of samples. There is a pretty long thread of us trying to make it work here: https://www.talkchess.com/forum3/viewtopic.php?f=7&t=71650&h...

I got some feedback form the botorch team back then: https://github.com/pytorch/botorch/issues/347#:~:text=thomas...

Replying as the author -- I do spend some time discussing hetereoskedastic noise (beginning in §2.2 and intermittedly throughout following chapters), although you're right that I don't discuss this particular modeling approach. Personally I think that inferring hetereoskedastic noise from data alone during Bayesian optimization is likely to be very difficult, as you'll need either a lot of data and/or to be in a very small dimension in order to identify the variable noise scale. (Note that the example in the hetGP writeup is only in one dimension.)

However, when the noise scale is either variable (but known) or can be modeled with a relatively simple (e.g., parametric) model, there may be some benefit to the added model complexity. Here you could include the parameters of the noise model into the model hyperparameters and proceed following the discussion in chapter 4. In doing so, I would be careful to ensure that the data actually support the heteroskedastic noise hypothesis.

Another approach that might be useful in some contexts is a heavy-tailed noise model such as Student-t errors (§§ 2.8, 11.9, 11.11).

Thanks for your suggestions. For my use case (tuning parameters of a financial market simulation), I'm essentially able to get good noise estimates for free by re-sampling a set of parameters multiple times.

So for example, rather than simulate an entire month in one shot, I'll simulate a day 30 times and therefore have a decent estimate of the noise for that result and be able to clearly distinguish the noise from the covariance of the Gaussian process.

The noise in these simulations can vary dramatically in parameter space (easily 10-100x), so it seems like it would be important to model.

That's a fortunate scenario! If you have good noise estimates available then you can sidestep the need to infer the noise scale and instead simply proceed with "typical" heteroskedastic inference. When the observation noise variances are known, you only need to modify the typical GP inference equations to replace the σ²I term that appears in the homoskedastic case (where σ² is the constant noise scale) with a diagonal matrix N indicating the noise variances associated with each observation along the diagonal.

(One might imagine a slightly more flexible model including a scaling parameter, replacing N with c²N and inferring c from data.)