| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nullc 455 days ago

Consider the entropy of the distribution of token X in these examples:

"Four X"

and

"Four X and seven years ago".

In the first case X could be pretty much anything, but in the second case we both know the only likely completion.

So it seems like there would be a huge advantage in not having to run autogressively. But in practice it's less significant then you might imagine because the AR model can internally model the probability of X conditioned on the stuff it hasn't output yet, and in fact because without reinforcement the training causes it converge on the target probability of the whole output, the AR model must do some form of lookahead internally.

(That said RLHF seems to break this product of the probabilities property pretty badly, so maybe it will be the case that diffusion will suffer less intelligence loss ::shrugs::).

3 comments

ttctciyf 455 days ago

> in the second case we both know the only likely completion.

You two may, but I don't. 'Decades'? 'Months'? 'Wives'? 'Jobs'? 'Conservative PMs'?

link

orbital-decay 455 days ago

Diffusion models are built around this type of internal lookahead from the start (accurate near prediction, progressively less accurate far prediction, step forward, repeat). They just do it in the coarse-to-fine direction, i.e. in a different dimension, and had more thought put into shortcuts and speed-accuracy tradeoffs in this process. RL is also used with both types of models. It's not immediately obvious that one must necessarily be more efficient.

link

byearthithatius 455 days ago

Both are conditional distributions on the context of which they were requested so like you said in the second paragraph, the difference is not significant. I see what you mean though and maybe there are use cases then where Diffusion is preferable. To me it seems the context conditional and internal model is sufficient where this problem doesn't really occur.

link

nullc 455 days ago

::nods:: in the case of diffusion though "conditional on its own (eventual) output" is more transparent and explicit.

As an example of one place that might make a difference is that some external syntax restriction in the sampler is going to enforce the next character after a space is "{".

Your normal AR LLM doesn't know about this restriction and may pick the tokens leading up to the "{" in a way which is regrettable given that there is going to be a {. The diffusion, OTOH, can avoid that error.

In the case where there isn't an artificial constraint on the sampler this doesn't come up because when its outputting the earlier tokens the AR model knows in some sense about it's own probability of outputting a { later on.

But in practice pretty much everyone engages in some amount of sampler twiddling, even if just cutting off low probability tokens.

As far as the internal model being sufficient, clearly it is or AR LLMs could hardly produce coherent English. But although it's sufficient it may not be particularly training or weight efficient.

I don't really know how these diffusion text models are trained so I can't really speculate, but it does seem to me that getting to make multiple passes might allow it less circuit depth. I think of it in terms of every AR step must expend effort predicting something about the following next few steps in order to output something sensible here, this has to be done over and over again, even though it doesn't change.

link

nullc 455 days ago

Totally separate from this line of discussion is that if you want to use an LLM for, say, copyediting it's pretty obvious to me how a diffusion model could get much better results.

Like if you take your existing document and measure the probability of your actual word vs an AR model's output, varrious words are going to show up as erroneously improbable even when the following text makes them obvious. A diffusion model should just be able to score up the entire text conditioned on the entire text rather than just the text in front of it.

link