|
|
|
|
|
by nullc
408 days ago
|
|
Consider the entropy of the distribution of token X in these examples: "Four X" and "Four X and seven years ago". In the first case X could be pretty much anything, but in the second case we both know the only likely completion. So it seems like there would be a huge advantage in not having to run autogressively. But in practice it's less significant then you might imagine because the AR model can internally model the probability of X conditioned on the stuff it hasn't output yet, and in fact because without reinforcement the training causes it converge on the target probability of the whole output, the AR model must do some form of lookahead internally. (That said RLHF seems to break this product of the probabilities property pretty badly, so maybe it will be the case that diffusion will suffer less intelligence loss ::shrugs::). |
|
You two may, but I don't. 'Decades'? 'Months'? 'Wives'? 'Jobs'? 'Conservative PMs'?