| HN Mirror

Perhaps it does some similar grouping of content, but this more directly incentivizes longer term gripping of tokens into abstract concepts. I agree that it's not obvious this would perform better than letting the model build it's own structures for grouping tokens, but the proof is in the pudding; the technique led to improved results for a given model & training size. This newer approach gives the model the freedom to build it's own breakpoints, but still bakes the idea into the algorithm itself.

What it means is a harder question. Perhaps transformers are simply an inefficient computational structure for this process? Perhaps a more flexible computational structure would integrate this step more efficiently? Perhaps Transformers are efficient enough, but our learning/densifying isn't? Or perhaps it's such a core powerful step that it might as well be built into the algo regardless? Much to learn.