| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yorwba 424 days ago
	They write "We manually inspect CoT validity to ensure correct answers stem from valid reasoning, not lucky guesses." but the example answer they show at the end only gets the correct number due to two errors canceling out. The model calculates 195+367+562+900 and gets 1924 instead of 2024, and also turns -437 - 2*234 into -805 instead of -905, but in total 1924-805 = 2024-905 = 1119 and from there the remaining steps are correct again. It would be interesting to know how much of the sampling efficiency improvement from reinforcement learning is due to being better at basic arithmetic (something which could also be achieved by giving the model access to a calculator tool) and how much is due to choosing the correct approach for solving the problem more often.