|
|
|
|
|
by deepGem
287 days ago
|
|
I spent a few weeks trying to build an alternative to self attention that scales memory linearly. I I got surprisingly good results. While in principle this makes a lot of sense, I am struggling to push the test accuracy above 86%. Some of the alternatives I am about to consider: 1. Diffusion with sparse attention layers.
2. Hierarchical diffusion - next token diffusion combined with higher order chunk diffusion. Still figuring out the code and I would love any feedback on these approaches. |
|