Hacker News new | ask | show | jobs
by yobbo 1066 days ago
You could start reading on CMA-ES; which is something like a particle filter on the model parameters. So for 100 "particles", it means 100 resampled copies of the model, which are then evaluated to create something like a "synthetic" gradient which is used to update a distribution over the model parameters.

But it doesn't solve the problem of local minima, and it will also need to use minibatches.