I am gonna read this paper and the other latent sentence later today. I always advocated for this kind of solutions together with latent sentence search should get to the next level of AI. Amazing work from Meta
Perhaps it does some similar grouping of content, but this more directly incentivizes longer term gripping of tokens into abstract concepts. I agree that it's not obvious this would perform better than letting the model build it's own structures for grouping tokens, but the proof is in the pudding; the technique led to improved results for a given model & training size. This newer approach gives the model the freedom to build it's own breakpoints, but still bakes the idea into the algorithm itself.
What it means is a harder question. Perhaps transformers are simply an inefficient computational structure for this process? Perhaps a more flexible computational structure would integrate this step more efficiently? Perhaps Transformers are efficient enough, but our learning/densifying isn't? Or perhaps it's such a core powerful step that it might as well be built into the algo regardless? Much to learn.
I don’t get it, isn’t this concept modelling exactly whats going on in the deeper layers of current LLMs?