|
|
|
|
|
by jbellis
397 days ago
|
|
[abstract] This approach significantly reduces the KV cache size relative to traditional multi-head attention [3.3] For saving the KV cache, only the intermediate
latent representations need to be stored: [latex] where r is much smaller than nh · dh [n-sub-h, d-sub-h] [background] In traditional multi-head attention you must cache full key and value matrices of size T x (nh · dh) where T is the token length, nh is the number of attention heads, dh is the dimensionality of each individual head sounds like a big win for memory constrained environments like local inference |
|