| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vlovich123 37 days ago

It has nothing to do with local RAM usage. But a million tokens of LLM context is decidedly not 5mb.

The rough estimate is 2 * L * H_kv * D * bytes per element

Where:

* L = number of layers * H_kv = # of KV heads * D = head dimension * factor of 2 = keys + values

The dominant factor here is typically 2 * H_kv * D since it’s usually at least 2048 bytes. Per token.

For Llama3 7B youre looking at 128gib if you’re context is really 1M (not that that particular model supports a context so big). DeepSeek4 uses something called sparse attention so the above calculus is improved - 1M of context would use 5-10GiB.

But regardless of the details, you’re off by several orders of magnitude.

1 comments

tujux 37 days ago

Pretty sure we're talking about the output text, not the tensors.

link

m00x 37 days ago

These LLM replies are really getting annoying.

link

vlovich123 37 days ago

Mine? I literally wrote what I wrote because “context window” as a term of art refers to the LLM’s context window.

I guess get better at detecting LLMs instead of accusing everything of being an LLM reply?

link