|
|
|
|
|
by oofbey
158 days ago
|
|
The big bet with this technique is in having a fixed (non learned) matrix which converts the tokens latent space to the linear attention space. So you can kinda cheat and say your model is small because a bunch of the smarts are in this fixed big graph laplacian matrix L. So how do you scale this up from a toy problem? Well that L would
Have to get bigger. And it’s hard to imagine it being useful if L is not trained. Then it starts to look a lot more like a conventional transformer, but probably harder to train, with the benefit of smaller KV caches. (Half the size - not a massive win.) So overall doesn’t seem to me like it’s gonna amount to anything. |
|