| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by krackers 215 days ago
	Is there a summary? Every time I try to understand more about what LeCun is saying all I see are strawmans of LLMs (like claims that LLMs cannot learn a world model or that next token prediction is insufficient for long-range planning). There are lots of tweaks you can do to LLMs without fundamentally changing the architecture, e.g. looped latents, adding additional models as preprocessors for input embeddings (in the way that image tokens are formed) I can buy that a pure next-token prediction inductive bias for training might be turn out to be inefficient (e.g. there's clearly lots of information in the residual stream that's being thrown away), but it's not at all obvious a priori to me as a layman at least that the transformer architecture is a "dead end"

3 comments

ACCount37 215 days ago

That's the issue I have with criticism of LLMs.

A lot of people say "LLMs are fundamentally flawed, a dead end, and can never become AGI", but on deeper examination? The arguments are weak at best, and completely bogus at worst. And then the suggested alternatives fail to outperform the baseline.

I think by now, it's clear that pure next token prediction as a training objective is insufficient in practice (might be sufficient in the limit?) - which is why we see things like RLHF, RLAIF and RLVR in post-training instead of just SFT. But that says little about the limitations of next token prediction as an architecture.

Next token prediction as a training objective still allows an LLM to learn an awful lot of useful features and representations in an unsupervised fashion, so it's not going away any time soon. But I do expect to see modified pre-training, with other objectives alongside it, to start steering the models towards features that are useful for inference early on.

link

sbinnee 215 days ago

You don’t sound like a layman knowing the looped latents and others :)

link

estebarb 215 days ago

The criticisms are not strawmans, are actually well grounded on math. For instance, promoting energy based models.

In a probability distribution model, the model is always forced to output a probability for a set of tokens, even if all the states are non sense. In an energy based model, the model can infer that a states makes no sense at all and can backtrack by itself.

Notice that diffusion models, DINO and other successful models are energy based models, or end up being good proxies of the data density (density is a proxy of entropy ~ information).

Finally, all probability models can be thought as energy based, but not all EBM output probabilities distributions.

So, his argument is not against transformers or the architectures themselves, but more about the learned geometry.

link

ACCount37 213 days ago

I'm really fucking math dumb. Can you explain what the "well grounded" part is, for the mathematically challenged?

Because all I've seen from the "energy based" approach in practice is a lot of hype and not a lot of results. If it isn't applicable to LLMs, then what is it applicable to? Where does it give an advantage? Why would you want it?

I really, genuinely don't get that.

link