Hacker News new | ask | show | jobs
by kcorbitt 524 days ago
Lots of folks working on open-source reasoning models trained with reinforcement learning right now. The best one atm appears to be Alibaba's 32B-parameter QwQ: https://qwenlm.github.io/blog/qwq-32b-preview/

I also recently wrote a blog explaining how reinforcement fine-tuning works, which is likely at least part of the pipeline used to train o1: https://openpipe.ai/blog/openai-rft

1 comments

I don't know if I would call it "the best one" when it has "How many r in strawberry" as one of its example questions and when tried it arrives at the answer "two".