| I have actually attempted this recently. I took a small 10M parameter Shakespeare language model used as an example in nanoGPT, swapped out gradient descent, tested various black-box optimizers from what I could find in literature. It takes 3 minutes to train the Shakespeare model with gradient descent. The black-box methods I tested so far likely take 30+ hours to train (I haven't tried to take them to the end yet). I've hit a wall where progress is very slow. The text generated at that stage has punctuation and words are split with spaces but the words themselves are mostly nonsense. Almost feels like it learned that English is letters separated by spaces, and that you put exclamation marks or periods at the end but not that much more. There's some larger scale CMA-ES variants I still want to test that don't have quality implementations. I've tried to stare at pictures of gradients and weights from half-trained models and trying to come up with ideas how to get there with black-box optimization. Also trying some original ideas where you compute a gradient, but you would not compute it against a loss function. The gradient would be more for discovering hidden structure in weights, that you would then put on some black-box optimizer as a guide (which I guess makes it not entirely black box. Gray box?) Possible? I mean, I guess technically. Practical? No way, unless some major breakthrough happens. My current goal is to just produce a model, even if training takes laughably long so I can say I've trained a language model using nothing but getting a fitness score from a black box function. Edit: if you are reading this and are aware of any other serious attempts at training a non-trivial sized language model without gradient descent I would want to know. So I can try their methods. I know there's some large scale stuff used in reinforcement learning like in one Uber paper but not in LLMs specifically. |
https://www.publichealth.columbia.edu/research/population-he...
https://en.wikipedia.org/wiki/Kriging
Basically, you are wasting most of your compute to come up with a rough local approximation to the thing you actually want. But that's sort of pointless in the NN training context, because what you want is basically the gradient (and maybe some higher order terms that tell you about the local curvature too).
CMAES makes sense when the gradient is not even well defined. For example, if you have a bunch of parameters for an airplane design, and then want to take that design and do a bunch of huge aerodynamics calculations to compute its lift, or do a big finite element analysis to measure how well it withstands various stresses, and at the end of that big analysis, you get back a number, like "maximum lift" or something. If each run takes hours on a supercomputer, then you clearly don't have anything close to a gradient and it would be very expensive to even try to approximate it numerically. So CMAES is useful there in helping you pick better high level parameters in a smart way-- basically it's a big improvement over grid search.