| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by PontifexCipher 545 days ago

Some info that may be missing:

- v2/v3 (not r1) seem to be cloned from o1/4o output, and perform worse (this cost the oft-repeated 5ish mm USD)

- r1 is specifically a reasoning step (using RL) _on top of_ v2/v3 and performs similarly to o1 (the cost of this is _not reported anywhere_)

- In the o1 blog post, they specifically say they use RL to add reasoning to LLMs: https://openai.com/index/learning-to-reason-with-llms/

1 comments

sudosysgen 545 days ago

The R1-Zero paper shows how many training steps the RL took, and it's not many. The cost of the RL is likely a small fraction of the cost of the foundational model.

link