Hacker News new | ask | show | jobs
by josh-sematic 643 days ago
They are indeed similar and OpenAI did indeed use RL at training time in a way that has not been done before, as does this approach. Yes both also involve some additional inference-time generation, but the problem is that (at least as of now) you can't get standard LLMs to actually do well with extra inference-time generation unless you have a training process that uses RL to teach them to do so effectively. I'm working on a blog post to explain more about this aimed at HN-level audiences. Stay tuned!
1 comments

For what it's worth, here's the post I was referring to: https://www.airtrain.ai/blog/how-openai-o1-changes-the-llm-t...

HN discussion here: https://news.ycombinator.com/item?id=41723384