| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by armcat 537 days ago
	It feels like lot of the reasoning tokens go to waste on pure brute force approach - plugging in numbers and evaluating and comparing against the answer. "Nope, that didn't work, let's try 4 instead of 6 this time", etc. What if the reward function instead focuses on diversity of procedures within a token budged (10k - 20k tokens). I.e. RL rewards the model in trying different methods or generating different hypotheses, rather than brute forcing its way through, and potentially getting stuck in loops.

1 comments

ANighRaisin 537 days ago

I would say that diversity isn't something that's easy to reenforce, but I do think it will occur as a natural consequence of optimizing for shorter chains of thought according to a wide variety of problems. Of course, the nature of the data may lead it to do brute force, but that can be fixed with clever fine tuning.

armcat 537 days ago

I am not too sure about shortening the CoT tokens explicitly because different problems will require different length of proof - some require half a page, whilst others will require 10 pages worth of tokens. As the graphs in the paper indicate, there is a huge penalty on short reasoning lengths, below a few thousand tokens.

For diversity reward, my thinking is basically looking at reasoning tokens in latent space - taking semantic similarity between subsequent chains, and if they are extremely similar, penalizing it.