Hacker News new | ask | show | jobs
by whimsicalism 643 days ago
you are describing the same thing?

sorry as a practitioner i’m having trouble understanding what point/distinction you are trying to make

1 comments

These are two very different things.

One is talking about an improvement made by making control flow changes during inference (no weights updates).

The other is talking about using reinforcement learning to do weight updates during training to promote a particular type response.

OpenAI had previously used reinforcement learning with human feedback (RLHF), which essentially relies on manual human scoring as its reward function, which is inherently slow and limited.

o1 and this paper talk about using techniques to create a useful reward function to use in RL that doesn't rely on human feedback.

No?

> I think this submission paper is talking about reinforcement learning as part of/after the main training

Reinforcement learning to promote a particular type of self-correction response

> They might have done that for O1, but the bigger change is the "runtime train of thought" that once the model received the prompt and before giving a definitive answer,

Also reinforcement learning to promote certain reasoning trace

> o1 and this paper talk about using techniques to create a useful reward function to use in RL that doesn't rely on human feedback.

Exactly -> the same thing

> as part of/after the main training

I take this to mean during weight updates, e.g. training.

> "runtime train of thought"

I take runtime here to mean inference, not during RL. What does runtime mean to you?

Previous approaches [0] successfully used inference time chain of thought to improve model responses. That has nothing to do with RL though.

The grandparent is wrong about the paper. They are doing chain of thought responses during training and doing RL on that to update the weights, not just during inference/runtime.

[0] https://arxiv.org/abs/2201.11903