Hacker News new | ask | show | jobs
by armcat 490 days ago
It feels like lot of the reasoning tokens go to waste on pure brute force approach - plugging in numbers and evaluating and comparing against the answer. "Nope, that didn't work, let's try 4 instead of 6 this time", etc. What if the reward function instead focuses on diversity of procedures within a token budged (10k - 20k tokens). I.e. RL rewards the model in trying different methods or generating different hypotheses, rather than brute forcing its way through, and potentially getting stuck in loops.
1 comments

I would say that diversity isn't something that's easy to reenforce, but I do think it will occur as a natural consequence of optimizing for shorter chains of thought according to a wide variety of problems. Of course, the nature of the data may lead it to do brute force, but that can be fixed with clever fine tuning.
I am not too sure about shortening the CoT tokens explicitly because different problems will require different length of proof - some require half a page, whilst others will require 10 pages worth of tokens. As the graphs in the paper indicate, there is a huge penalty on short reasoning lengths, below a few thousand tokens.

For diversity reward, my thinking is basically looking at reasoning tokens in latent space - taking semantic similarity between subsequent chains, and if they are extremely similar, penalizing it.