|
|
|
|
|
by BarakWidawsky
315 days ago
|
|
I wonder how much of this is due to Diffusion models having less capacity for memorization than auto regressive models The auto regressive models consistently show better loss for the same number of training tokens I find a lot of the conclusions compelling but I would’ve loved to see more epochs of training on the 1B model with a 10B dataset, as that model was showing epoch over epoch improvements |
|
Diffusion requires more computation resources than autoregressive models, compute excess is proportional to the length of sequence. Time dilated RNNs and adaptive computation in image recognition hint us that we can compute more with same weights and achieve better results.
Which, I believe, also hint at the at least one flaw of the TS study - I did not see that they matched DLM and AR by compute, they matched them only by weights.