|
|
|
|
|
by HarHarVeryFunny
497 days ago
|
|
Latent / embedding-space reasoning seems a step in the right direction, but building recurrence into the model while still relying on gradient descent (i.e. BPTT) to train it seems to create more of a problem (training inefficiency) than it solves, especially since they still end up externally specifying the number of recurrent iterations (r=4, 8, etc) for a given inference. Ideally having recurrence internal to the model would allow the model itself to decide how long to iterate for before outputting anything. |
|
I imagine that they choose a fixed number of recurrent iterations during training for parallelization purposes. Not depending on the previous step to train the next is the main revolution about transformers vs LSTM (plus the higher internal bandwidth). But I agree that it might not be the most efficient model to train due to all that redundant work at large r.