|
|
|
|
|
by in-silico
35 days ago
|
|
While there is a limit to the amount of information you can fit in a fixed-size state, the theoretical ceiling is pretty high. A Hebbian associative matrix (one of the simplest and weakest memory constructions) can store about 0.7 bits of information per parameter. If you have a state with 300M parameters (the size of a Llama 3 8B KV cache at 10K context length), and a context with 2.1 bits of entropy per token (a reasonable estimate), then the state can encode 100M tokens worth of information. Real models obviously aren't powerful enough to operate at the limit, but you can see why this is a promising research direction. |
|
Can you elaborate on this? I'm seen estimates of ~1.5bit per English letter, and tokens encode a lot more than that - sometimes full words, with multimodal even more. If KV cache embedding are storing more than just simple tokens but entire concepts with context and nuance, that'll bump the entropy up quite quickly.