|
|
|
|
|
by radarsat1
260 days ago
|
|
Thanks something like that was going through my mind, nice to get a good reference for it. Any insights on why this is not a more popular approach? Maybe it's too difficult for a single layer to scale. I read a paper recently on something similar for diffusion, called Fixed Point Diffusion Models. They specialize the first and last layers but recurse the middle layer some number of times until convergence. Considering how a Transformer is a residual model, each layer must be adding more and more precise adjustments to the selected token. It makes a lot of sense to think of this like the steps of an optimisation method. |
|