|
|
|
|
|
by AaronFriel
760 days ago
|
|
The attention mechanism is vastly more efficient to train when it can attend to larger, more meaningful tokens. For inference servers, a significant amount of memory goes into the KV cache, and as you note, to build up the embedding through attention would then require correlating far more tokens, each of which is "less meaningful". I think we may get to this point eventually, in the limit we will want multimodal LLMs that understand images and sounds down to the pixel and frequency, and it seems like for text, too, we will eventually want that as well. |
|
Just make sure to have some big MLPs at the start too, to enrich the "tokens" with the information currently stored in the embedding tables.