Hacker News new | ask | show | jobs
by optimalsolver 1069 days ago
Is it possible to use derivative-free/black-box optimizers to train these large networks?

From what I understand, gradient descent and its cousins can't suddenly jump to a distant global optimum.

8 comments

I have actually attempted this recently. I took a small 10M parameter Shakespeare language model used as an example in nanoGPT, swapped out gradient descent, tested various black-box optimizers from what I could find in literature.

It takes 3 minutes to train the Shakespeare model with gradient descent. The black-box methods I tested so far likely take 30+ hours to train (I haven't tried to take them to the end yet). I've hit a wall where progress is very slow. The text generated at that stage has punctuation and words are split with spaces but the words themselves are mostly nonsense. Almost feels like it learned that English is letters separated by spaces, and that you put exclamation marks or periods at the end but not that much more.

There's some larger scale CMA-ES variants I still want to test that don't have quality implementations. I've tried to stare at pictures of gradients and weights from half-trained models and trying to come up with ideas how to get there with black-box optimization. Also trying some original ideas where you compute a gradient, but you would not compute it against a loss function. The gradient would be more for discovering hidden structure in weights, that you would then put on some black-box optimizer as a guide (which I guess makes it not entirely black box. Gray box?)

Possible? I mean, I guess technically. Practical? No way, unless some major breakthrough happens.

My current goal is to just produce a model, even if training takes laughably long so I can say I've trained a language model using nothing but getting a fitness score from a black box function.

Edit: if you are reading this and are aware of any other serious attempts at training a non-trivial sized language model without gradient descent I would want to know. So I can try their methods. I know there's some large scale stuff used in reinforcement learning like in one Uber paper but not in LLMs specifically.

Not really surprising-- CMAES basically replaces the actual gradient you care about with a rough numerical approximation to it that's based on looking at lots of input-output pairs. I think the concept originated in surveying, where it's called the technique of "kriging":

https://www.publichealth.columbia.edu/research/population-he...

https://en.wikipedia.org/wiki/Kriging

Basically, you are wasting most of your compute to come up with a rough local approximation to the thing you actually want. But that's sort of pointless in the NN training context, because what you want is basically the gradient (and maybe some higher order terms that tell you about the local curvature too).

CMAES makes sense when the gradient is not even well defined. For example, if you have a bunch of parameters for an airplane design, and then want to take that design and do a bunch of huge aerodynamics calculations to compute its lift, or do a big finite element analysis to measure how well it withstands various stresses, and at the end of that big analysis, you get back a number, like "maximum lift" or something. If each run takes hours on a supercomputer, then you clearly don't have anything close to a gradient and it would be very expensive to even try to approximate it numerically. So CMAES is useful there in helping you pick better high level parameters in a smart way-- basically it's a big improvement over grid search.

I think I saw a paper that argued that CMA-ES is making an approximation to the natural gradient, which is not the same gradient you see in typical NN trainings. Or at least so I understood it. (I have no background in data science or ML, I'm just a bored engineer)

I haven't estimated the number of trials you would need for 10M Shakespeare model but I think to get to the same level as gradient descent, it might be around 10M, i.e. same ballpark as the number of parameters. Which makes some intuitive sense because of how little you learn from each black box function evaluation.

There's maybe some hope that there is a hidden structure in the problem that does not actually need anywhere close to 10M parameters so that a black box optimizer might still find it. I don't have my hopes up though but I'm trying to poke at it.

I would think that if it turns out LLMs are not totally impossible with black box optimization, then it would be good to find a reason to use it. E.g. objective functions that don't have a hope of having a good gradient. Some kind of weird architecture that can't be trained conventionally. Maybe fine-tuning gradient descent optimized models with those things. Etc. Feels like a solution looking for a problem.

I'm doing my project for fun and just seeing if it's possible at all.

Have you tried Hinton's forward-forward method?
I have not. I have read the paper though. I do want to try it and likely will at some point.

Next up after this project is that I want to test some metalearning ideas. I read some papers where the idea is that all weights are actually tinier neural networks, all with the same parameters where you train it to learn backpropagation (or whatever learning algorithm it converges to). The paper I read this from argued it also worked for forward-only but my intuition doesn't quite understand how. I want to follow up a bit on this line of research and check if there's been any new developments since I read them and try them out in my own code.

Very interesting, thanks for sharing. I would be interested in reading more on gradient-free optimization applied to large problems (like LLMs).
No one has the scale to make that happen.

It's about information. Gradient-free methods integrate little or no information about the problem; they're a blind watchmaker. This works, but it's slow and gets slower the bigger your problem is. (curse of dimensionality)

Gradients integrate some limited information about the problem. This lets you find solutions much faster, and neural networks are structured specifically to be easy to optimize with gradients. Local minima don't seem to be a problem.

The future is probably even smarter optimizers that integrate more information about the problem and learn to make good assumptions. This is the goal of Learned Optimizers, like Velo (https://arxiv.org/abs/2211.09760).

Thank you for posting Velo optimizer - very interesting work.
You could start reading on CMA-ES; which is something like a particle filter on the model parameters. So for 100 "particles", it means 100 resampled copies of the model, which are then evaluated to create something like a "synthetic" gradient which is used to update a distribution over the model parameters.

But it doesn't solve the problem of local minima, and it will also need to use minibatches.

People have tried, but all the local minimums perform about the same so there's no point trying to find a global minimum. A much better strategy is to train multiple models and use all of them at inference time to score better
You have a gradient so use it instead of faffing about. As another user said, the optima are all the same since the models are wildly over-parameterized.
This is tersely stated, but it's wise. In general, following the gradient (if you have it) is a very, very good idea.
First order optimizers have trouble because they fall into local minima, however, in practice things are different.

When your parameter space is in the order of billions, for all practical purposes, there is always a direction of descent.

More over, local minima seem to be rather close to the global minima.

I believe there are still ongoing efforts in this. See for example the forward forward algorithm.

https://www.google.com/url?sa=t&source=web&rct=j&opi=8997844...

particle swarm optimization, genetic algorithms, and tabu search/heuristic search are some items I'm aware of to force out of local optimum. Using Halton sequences can also help cover the space for search initialization, versus simple random draws in a space.
From my granted limited understanding Adam is basically gradient decent combined with heuristic search.
Adam is somewhat analogous to an audio compressor on the gradient "signals".

(Edit: eh sort of ...)

Yes, and the only reason it doesn't work is that no one has written truly fast, GPU implementations of them. Don't let anyone here teach you otherwise, even small scale crappy versions (like what I could code in numpy) can successfully solve reinforcement learning problems rather quickly. Nay-sayers might tell you that it doesn't work, but they are wrong. Global optimization is strictly superior to local optimization in general, and we in the AI field are stuck deep in a local minimum right now.

Here's me implementing an algorithm from 2009 in single-core on a CPU and getting pretty excellent results on RLHF benchmarks: https://github.com/Hellisotherpeople/Python-Cooperative-Syna...

The whole argument is that at large scales with billions of params it doesn’t matter specifically because of those billions so giving a toy example seems to miss the point.
No.
More correctly: we don't know how to do this efficiently. Biological neural networks don't use backpropagation and work great.
Conference takes years, though... It's entirely possible that back prop is unrealistic but far more efficient than biological learning.
Out of my depth so happy to be corrected.

Don’t many/most state of the art models take many months to train on far more data than humans need for similar tasks?

Also, while e.g. GPT4 is quite capable across many tasks - humans seem to average towards learning robust _learning techniques_ themselves. Learning a new subject becomes easier thanks to somehow tracking and encoding learning strategies that are robust to learning other unrelated topics.

> Don’t many/most state of the art models take many months to train on far more data than humans need for similar tasks?

Humans generally need 18 years of pre training followed by 4-6 years of fine tuning before they can “one-shot” many difficult tasks. That’s way more training than any machine learning model I’m aware of.

Even for tasks like reading the newspaper and summarizing what you read, you probably had to train for 10-12 years.

I see this stance of yours parroted over and over but a 3 year old can tell a dog from a cat doesn't need to be trained on millions of images. Also uses way less energy for that.