| HN Mirror

I see those too and I think of it as the "thinking" in action. If you could replace their actual thinking trace with gibberish and get improved performance that scaled with the amount of gibberish you injected, that's what we'd do. But instead, we see that the quality of of the model's output scales with the amount of 'thinking' tokens they generate before responding.

It has been my experience that yes, models make contradictions throughout their thinking process, but the conclusions they arrive at during/near the end of thinking more often than not align with the final output.