|
|
|
|
|
by cma
315 days ago
|
|
> The auto regressive models consistently show better loss for the same number of training tokens I thought bi-directional transformers (non auto-regressive) show less loss than autoregressive for the same amount of training tokens. |
|