Hacker News new | ask | show | jobs
by Scene_Cast2 199 days ago
I'm not sure how likely it is that an answer would fall outside of the top-p of 0.95 (used in the paper). A random number generator would also need an unreasonably high number of samples to get a correct answer. I think figures 17 and 18 are interesting for this discussion too, they show performance at various sampling temperatures. I think the point of the paper is that RL "sharpens" the distribution of non-RL nets, but it does not uncover any new reasoning paths - non-RL nets already had multiple decently high probability paths of answering questions to begin with, and RL reuses a subset of those.
1 comments

  > I think the point of the paper is that RL "sharpens" the distribution of non-RL nets, but it does not uncover any new reasoning paths
This is an implication of the results that's intuitive and likely to be correct, but isn't guaranteed to be correct. The results do show worse answer correctness for large k. But answers and reasoning strategies to arrive at these answers are different things. It's impractical to inspect the CoTs in both the RL and Base to show that all the reasoning strategies used by the former are a subset of the latter. For all we know the venn diagram might not be fully overlapping. It could be that the RL did uncover some novel and subtle reasoning strategies not present in the Base, but it also introduced separate handicaps for some unknown reason, which nerfed answer correctness for large k. We need some theory to bridge that understanding which seems lacking in the paper? Not that I fault them for an absence of such a theory because it seems intractable. But then I am doubtful one could reach such a neat conclusion as they have tried to do, beyond the appeal to strong intuition (which I also share).
Ah, I think I agree. There could be a potential unrelated handicap, so there is a lack of a guarantee or a proof.