| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by AaronFriel 806 days ago
	The attention mechanism is vastly more efficient to train when it can attend to larger, more meaningful tokens. For inference servers, a significant amount of memory goes into the KV cache, and as you note, to build up the embedding through attention would then require correlating far more tokens, each of which is "less meaningful". I think we may get to this point eventually, in the limit we will want multimodal LLMs that understand images and sounds down to the pixel and frequency, and it seems like for text, too, we will eventually want that as well.

2 comments

thomasahle 806 days ago

Maybe you could just use a good-old 1D-CNN for the bottom 3-4 layers. Then the model has been able to combine characters into roughly token length chunks anyway.

Just make sure to have some big MLPs at the start too, to enrich the "tokens" with the information currently stored in the embedding tables.

link

yk 806 days ago

> a significant amount of memory goes into the KV cache

Is there a good paper (or talk) how inference looks at scale? (Kinda like ELI-using-single-gpus)

link

AaronFriel 805 days ago

The PagedAttention paper is a good starting point as it represents the first major open source inference engine that had "pretty good" batch performance for transformers.

https://arxiv.org/pdf/2309.06180

link