| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by atgctg 668 days ago

You have to store the KV cache, not the tokens. For Gemma 27B (probably slightly larger than Flash), this would be:

  Size of KV cache = 2 * (num_layers) * (num_kv_heads * dim_head) * seq_length * precision

  8-bit Gemma 27B KV cache = 2 * (46) * (16 * 144) * 1e6 * 1 byte ≈ 200 GB

Note that this doesn't take further optimizations into account that Google might be using.

Formula: https://developer.nvidia.com/blog/mastering-llm-techniques-i...

Gemma 27B config: https://huggingface.co/google/gemma-2-27b/blob/main/config.j...

1 comments

manojlds 668 days ago

Is there some easy to understand source / paper about how this caching works?

link

xihajun 666 days ago

https://arxiv.org/pdf/2311.04934

link

danielmarkbruce 667 days ago

Ask chat gpt to explain how K-V caching works. What they are doing is essentially the same thing, with a few more engineering details.

link