| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by smnscu 32 days ago

Nice post! You piqued my curiosity, so after a bit of research it turns out that, with techniques like MTP/MLA/CSA, it's quite probable that these models are much more efficient (and maybe bigger? tho 400B sounds about right) than a simple RAM breakdown would suggest.

MTP - https://blog.google/innovation-and-ai/technology/developers-...

MLA - https://machinelearningmastery.com/a-gentle-introduction-to-...

CSA - https://deepseek.ai/blog/deepseek-v4-compressed-attention

1 comments

Doxon 30 days ago

These techniques are used by DeepSeek, and work well with the commodity (NVIDIA) GPU's they use. Google designs their entire AI stack from the custom silicon up. So they have different optimization approaches. (Though Gemma does use MTP)

link