Hacker News new | ask | show | jobs
by a1j9o94 511 days ago
Probably not the whole model, but the first step was "fine tuning" the base model on ~800 chain of thought examples.

Those were probably from OpenAI models. Then they used reinforcement learning to expand the reasoning capabilities.

1 comments

800k. They say they came from earlier versions of their own models, with a lot of bad examples rejected. They don't seem to say which models they got the "thousands of cold-start" examples from earlier in the process though.