Nice post! You piqued my curiosity, so after a bit of research it turns out that, with techniques like MTP/MLA/CSA, it's quite probable that these models are much more efficient (and maybe bigger? tho 400B sounds about right) than a simple RAM breakdown would suggest.
These techniques are used by DeepSeek, and work well with the commodity (NVIDIA) GPU's they use.
Google designs their entire AI stack from the custom silicon up. So they have different optimization approaches.
(Though Gemma does use MTP)