| HN Mirror

I always assumed the reason is that you are working with the pretrained model rather than against it. Whatever “logic” rules or functions the model came up with to compress (make more sense of) the vast amounts of pretraining data, it then uses the same functions during RL. Of course, distillation from a strong, huge model might still help more than RL directly applied on the small model because the strong model came up with much better functions/reasoning during pretraining, which the small model can simply copy. These models all learn in different ways than most humans, so human-based SFT can only go so far.