| HN Mirror

Sorry if that was unclear, I did mean 100Bs as in the next order of magnitude. Even GPT-4 had ~220B active params, though the trend has been towards increased sparsification (lower activation:total ratio). GPT 4.5 is the only publicly facing model that approached 1T active parameters (an experiment to see if there was any value in the extreme inference cost of quadratically increasing compute cost with naïve-like attention). Nowadays you optimize your head size to your attention kernel arch and obtain performance principally through inference time scaling (generate more of tokens) and parallel consensus (gpt pro, gemini deep think etc), both of which favor faster, cheaper active heads.

4o and other H100 era models did indeed drop their activated heads far smaller than gpt-4 to the 10s just like current Hopper-Era Chinese open-source, but it went right back up again post-Blackwell with the 10x L2 bump (for kv cache) in congruence with nlogn attention mechanisms being refined. Similar story for Claude.

The fun speculation is wondering about the true size of Gemini 3's internals, given the petabyte+ world size of their homefield IronwoodV7 systems and Jim Keller's public penchant for envisioning extreme MoE-like diversification across hundreds of dedicated sub-models constructed by individual teams within DeepMind.