| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by a1j9o94 511 days ago
	Probably not the whole model, but the first step was "fine tuning" the base model on ~800 chain of thought examples. Those were probably from OpenAI models. Then they used reinforcement learning to expand the reasoning capabilities.

1 comments

mkl 511 days ago

800k. They say they came from earlier versions of their own models, with a lot of bad examples rejected. They don't seem to say which models they got the "thousands of cold-start" examples from earlier in the process though.

link