| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by JacobJeppesen 934 days ago

Seems like they have made progress in combining reinforcement learning and LLMs. Andrej Karpathy mentions it in his new talk (~38 minutes in) [1], and Ilya Sutskever talks about it in a lecture at MIT (~29 minutes in) [2]. It would be a huge breakthrough to find a proper reward function to train LLMs in a reinforcement learning setup, and to train a model to solve math problems in a similar fashion to how AlphaGo used self-play to learn Go.

[1] https://www.youtube.com/watch?v=zjkBMFhNj_g&t=2282s

[2] https://www.youtube.com/watch?v=9EN_HoEk3KY&t=1705s

4 comments

jug 934 days ago

Q* may also be a reference to the well-known A* search algorithm but with this letter referring to Q-learning, further backing the reinforcement learning theory. https://en.wikipedia.org/wiki/Q-learning

Sol- 934 days ago

Thanks for the links, very interesting.

Wonder how a "self-play" equivalent would look like for LLMs, since they have no easy criterion to evaluate how well they are doing like in Go (as mentioned in the videos).

HarHarVeryFunny 934 days ago

I expect self-consistency might be one useful reward function.

Of course in the real world, for a real intelligent system, reality is the feedback/reward system, but for an LLM limited to it's training set, with nothing to ground it, maybe this is the best you can do ...

The idea is essentially that you need to assume (but of course GI-GO) that most of the training data is factual/reasonable whether in terms of facts or logic, and therefore that anything you can deduce from the training data that is consistent with the majority of the training data should be held as similarly valid (and vice versa).

Of course this critically hinges on the quality of the training data in the first place. Maybe it would work best with differently tagged "tiers" of training data with different levels of presumed authority and reasonableness. Let the better data be used as a proxy for ground truth to "police" the lesser quality data.

93po 934 days ago

Maybe I’m off mark here but it seems like video footage of real life would be a massively beneficial data set because it can watch these videos and predict what will happen one second into the future and then see if it was correct. And it can do this over millions of hours of footage and have billions of data points.

HarHarVeryFunny 934 days ago

Yes - that would help, but only to limited degree if just part of training set.

1) Really need runtime prediction feedback, not just pretraining

2) Really need feedback on results of one's own (prediction-driven) actions (incl. speech), not just on passive "what will happen next" observations

jhrmnn 934 days ago

In math specifically, one could easily imagine a reward signal from some automated theorem proving engine

cubefox 931 days ago

Yeah. I went into some detail of how it might work here: https://news.ycombinator.com/item?id=38036986

manx 934 days ago

One could generate arbitrarily many math problems, where the solution is known.

lixy 934 days ago

It seems plausible you could have the LLM side call upon its knowledge of known problems and answers to quiz the q-learning side.

While this would still rely on a knowledge base in the LLM, I would imagine it could simplify the effort required to train reinforcement learning models, while widening the domains it could apply to.

walthamstow 934 days ago

ChatGPT does have some feedback that can be used to evaluate, in the form of thumbs up/down buttons, which probably nobody uses, and positive/negative responses to its messages. People often say "thanks" or "perfect!" in responses, including very smart people who frequent here.

lagrange77 934 days ago

ChatGPT was trained (in an additional step to supervised learning of the base LLM) with reinforcement learning from human feedback (RLHF) where some contractors were presented with two LLM output to the same prompt and they had to decide, which one is better. This was a core ingredient to the performance of the system.

93po 934 days ago

They could also look at the use of the regenerate button, which I do use often, and would serve the same purpose

ChatGTP 934 days ago

The veil of ignorance has been pushed back and the frontier of discovery forward

jansan 934 days ago

Well, you could post a vast amount of comments into social media and see if and how others react to it. It's still humans doing the work, but they would not even know.

If this was actually done (and this is just wild baseless speculation), this would be a good reason to let Sam go.

93po 934 days ago

I see a lot of comments on reddit these days that are very clearly language models so it’s probably already happening on a large scale

AlexAndScripts 934 days ago

Have you got an example you could show? I'm curious