| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by xianshou 400 days ago

Calling it now - RL finally "just works" for any domain where answers are easily verifiable. Verifiability was always a prerequisite, but the difference from prior generations (not just AlphaGo, but any nontrivial RL process prior to roughly mid-2024) is that the reasoning traces and/or intermediate steps can be open-ended with potentially infinite branching, no clear notion of "steps" or nodes and edges in the game tree, and a wide range of equally valid solutions. As long as the quality of the end result can be evaluated cleanly, LLM-based RL is good to go.

As a corollary, once you add in self-play with random variation, the synthetic data problem is solved for coding, math, and some classes of scientific reasoning. No more modal collapse, no more massive teams of PhDs needed for human labeling, as long as you have a reliable metric for answer quality.

This isn't just neat, it's important - as we run out of useful human-generated data, RL scaling is the best candidate to take over where pretraining left off.

10 comments

resiros 400 days ago

Skimmed quickly the paper. This does not look like RL. It's a genetic algorithm. In a previous life I was working on compbio (protein structure prediction), we built 100s of such heuristic based algorithm (monte carlo simulated annealing, ga..). The moment you have a good energy function (one that provide some sort of gradient), and a fast enough sampling function (llms), you can do looots of cool optmization with sufficient compute.

I guess that's now becoming true with LLMs.

Faster LLMs -> More intelligence

link

UncleOxidant 400 days ago

> This does not look like RL. It's a genetic algorithm.

couldn't you say that if you squint hard enough, GA looks like a category of RL? There are certainly a lot of similarities, the main difference being how each new population of solutions is generated. Would not at all be surprised that they're using a GA/RL hybrid.

link

vjerancrnjak 400 days ago

Genetic algorithm is worse than gradient descent.

If variety is sought, why not beam with nice population statistic.

link

moregrist 400 days ago

This depends quite a bit of what you’re trying to optimize.

Gradient descent is literally following the negative of the gradient to minimize a function. It requires a continuous domain, either analytical or numerical derivatives of the cost function, and has well-known issues in narrow valleys and other complex landscapes.

It’s also a local minimization technique and cannot escape local minima by itself.

_Stochastic_ gradient descent and related techniques can overcome some of these difficulties, but are still more or less local minimization techniques and require differentiable and continuous scoring functions.

In contrast, genetic algorithms try to find global minima, do not require differentiable scoring functions, and can operate on both continuous and discrete domains. They have their own disadvantages.

Different techniques for different problems. The field of numerical optimization is vast and ancient for a reason.

link

yorwba 400 days ago

You also need a base model that can satisfy the verifier at least some of the time. If all attempts fail, there's nothing there to reinforce. The reinforcement-learning algorithms themselves haven't changed much, but LLMs got good enough on many problems that RL could be applied. So for any given class of problem you still need enough human data to get initial performance better than random.

link

skybrian 400 days ago

There's no API or product yet, so it seems unlikely that they made it to a "just works" level of polish?

They are having some success in making it work internally. Maybe only the team that built it can get it to work? But it does seem promising.

link

unignorant 400 days ago

This technique doesn't actually use RL at all! There’s no policy-gradient training, value function, or self-play RL loop like in AlphaZero/AlphaTensor/AlphaDev.

As far as I can read, the weights of the LLM are not modified. They do some kind of candidate selection via evolutionary algorithms for the LLM prompt, which the LLM then remixes. This process then iterates like a typical evolutionary algorithm.

link

modeless 400 days ago

IMO RL can only solve "easy" problems. The reason RL works now is that unsupervised learning is a general recipe for transforming hard problems into easy ones. But it can't go all the way to solutions, you need RL on top for that. Yann LeCun's "cherry on top" analogy was right.

link

smattiso 400 days ago

Are there platforms that make such training more streamlined? Say I have some definition of success for a given problem and it’s data how do I go about generating said RL model as fast and easily as possible?

link

vrm 400 days ago

We're working on an OSS industrial-grade version of this at TensorZero but there's a long way to go. I think the easiest out of the box solution today is probably OpenAI RFT but that's a partial solve with substantial vendor lock-in.

link

4b11b4 400 days ago

This isn't quite RL, right...? It's an evolutionary approach on specifically labeled sections of code optimizing towards a set of metrics defined by evaluation functions written by a human.

I suppose you could consider that last part (optimizing some metric) "RL".

However, it's missing a key concept of RL which is the exploration/exploitation tradeoff.

link

TechDebtDevin 400 days ago

Most things are verifiable, just not with code. I'm not particularly excited for a world where everything is predictable. This is coming from a guy who loves forecasting/prediction modeling too, but one thing I hate about prediction modeling, especially from a hobbyist standpoint is data. Its very hard to get useful data. Investors will literally buy into hospital groups to get medical data for example.

There are monopolies on the coolest sets of data in almost all industries, all the RL in the world won't do us any good if those companies doing the data hoarding are only using it to forecast outcomes that will make them more money, not what can be done to better society.

link

spyckie2 400 days ago

I think you mean the general class of algorithms that scale with compute times, RL being the chief example. But yes I agree to that point.

link

obsolete_wagie 400 days ago

Yup. Its coming. Any verifiable human skill will be done by ai.

link