| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by diggan 644 days ago

I think this submission paper is talking about reinforcement learning as part of/after the main training, then the model does inference as normal.

They might have done that for O1, but the bigger change is the "runtime train of thought" that once the model received the prompt and before giving a definitive answer, it "thinks" with words and readjusts at runtime.

At least that's my understanding from these two approaches, and if that's true, then it's not similar.

AFAIK, OpenAI been doing reinforcement learning since the first version of ChatGPT for all future models, that's why you can leave feedback in the UI in the first place.

4 comments

numeri 644 days ago

OpenAI stated [1] that one of the breakthroughs needed for o1's train of thought to work was reinforcement learning to teach it to recover from faulty reasoning.

> Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working.

That's incredibly similar to this paper, which is discusses the difficulty in finding a training method that guides the model to learn a self-correcting technique (in which subsequent attempts learn from and improve on previous attempts), instead of just "collapsing" into a mode of trying to get the answer right with the very first try.

[1]: https://openai.com/index/learning-to-reason-with-llms/

link

josh-sematic 643 days ago

They are indeed similar and OpenAI did indeed use RL at training time in a way that has not been done before, as does this approach. Yes both also involve some additional inference-time generation, but the problem is that (at least as of now) you can't get standard LLMs to actually do well with extra inference-time generation unless you have a training process that uses RL to teach them to do so effectively. I'm working on a blog post to explain more about this aimed at HN-level audiences. Stay tuned!

link

josh-sematic 631 days ago

For what it's worth, here's the post I was referring to: https://www.airtrain.ai/blog/how-openai-o1-changes-the-llm-t...

HN discussion here: https://news.ycombinator.com/item?id=41723384

link

nsagent 644 days ago

Both models generate an answer after multiple turns, where each turn has access to the outputs from a previous turn. Both refer to the chain of outputs as a trace.

Since OpenAI did not specify what exactly is in their reasoning trace, it's not clear what if any difference there is between the approaches. They could be vastly different, or they could be slight variations of each other. Without details from OpenAI, it's not currently possible to tell.

link

whimsicalism 643 days ago

you are describing the same thing?

sorry as a practitioner i’m having trouble understanding what point/distinction you are trying to make

link

myownpetard 643 days ago

These are two very different things.

One is talking about an improvement made by making control flow changes during inference (no weights updates).

The other is talking about using reinforcement learning to do weight updates during training to promote a particular type response.

OpenAI had previously used reinforcement learning with human feedback (RLHF), which essentially relies on manual human scoring as its reward function, which is inherently slow and limited.

o1 and this paper talk about using techniques to create a useful reward function to use in RL that doesn't rely on human feedback.

link

whimsicalism 643 days ago

No?

> I think this submission paper is talking about reinforcement learning as part of/after the main training

Reinforcement learning to promote a particular type of self-correction response

> They might have done that for O1, but the bigger change is the "runtime train of thought" that once the model received the prompt and before giving a definitive answer,

Also reinforcement learning to promote certain reasoning trace

> o1 and this paper talk about using techniques to create a useful reward function to use in RL that doesn't rely on human feedback.

Exactly -> the same thing

link

myownpetard 642 days ago

> as part of/after the main training

I take this to mean during weight updates, e.g. training.

> "runtime train of thought"

I take runtime here to mean inference, not during RL. What does runtime mean to you?

Previous approaches [0] successfully used inference time chain of thought to improve model responses. That has nothing to do with RL though.

The grandparent is wrong about the paper. They are doing chain of thought responses during training and doing RL on that to update the weights, not just during inference/runtime.

[0] https://arxiv.org/abs/2201.11903

link