Hacker News new | ask | show | jobs
by Sol- 939 days ago
Thanks for the links, very interesting.

Wonder how a "self-play" equivalent would look like for LLMs, since they have no easy criterion to evaluate how well they are doing like in Go (as mentioned in the videos).

5 comments

I expect self-consistency might be one useful reward function.

Of course in the real world, for a real intelligent system, reality is the feedback/reward system, but for an LLM limited to it's training set, with nothing to ground it, maybe this is the best you can do ...

The idea is essentially that you need to assume (but of course GI-GO) that most of the training data is factual/reasonable whether in terms of facts or logic, and therefore that anything you can deduce from the training data that is consistent with the majority of the training data should be held as similarly valid (and vice versa).

Of course this critically hinges on the quality of the training data in the first place. Maybe it would work best with differently tagged "tiers" of training data with different levels of presumed authority and reasonableness. Let the better data be used as a proxy for ground truth to "police" the lesser quality data.

Maybe I’m off mark here but it seems like video footage of real life would be a massively beneficial data set because it can watch these videos and predict what will happen one second into the future and then see if it was correct. And it can do this over millions of hours of footage and have billions of data points.
Yes - that would help, but only to limited degree if just part of training set.

1) Really need runtime prediction feedback, not just pretraining

2) Really need feedback on results of one's own (prediction-driven) actions (incl. speech), not just on passive "what will happen next" observations

In math specifically, one could easily imagine a reward signal from some automated theorem proving engine
Yeah. I went into some detail of how it might work here: https://news.ycombinator.com/item?id=38036986
One could generate arbitrarily many math problems, where the solution is known.
It seems plausible you could have the LLM side call upon its knowledge of known problems and answers to quiz the q-learning side.

While this would still rely on a knowledge base in the LLM, I would imagine it could simplify the effort required to train reinforcement learning models, while widening the domains it could apply to.

ChatGPT does have some feedback that can be used to evaluate, in the form of thumbs up/down buttons, which probably nobody uses, and positive/negative responses to its messages. People often say "thanks" or "perfect!" in responses, including very smart people who frequent here.
ChatGPT was trained (in an additional step to supervised learning of the base LLM) with reinforcement learning from human feedback (RLHF) where some contractors were presented with two LLM output to the same prompt and they had to decide, which one is better. This was a core ingredient to the performance of the system.
They could also look at the use of the regenerate button, which I do use often, and would serve the same purpose