|
|
|
|
|
by ANighRaisin
491 days ago
|
|
I would say that diversity isn't something that's easy to reenforce, but I do think it will occur as a natural consequence of optimizing for shorter chains of thought according to a wide variety of problems. Of course, the nature of the data may lead it to do brute force, but that can be fixed with clever fine tuning. |
|
For diversity reward, my thinking is basically looking at reasoning tokens in latent space - taking semantic similarity between subsequent chains, and if they are extremely similar, penalizing it.