| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by BoorishBears 314 days ago

MoE expected performance = sqrt(active heads * total parameter count)

sqrt(120*5) ~= 24

GPT-OSS 120B is effectively a 24B parameter model with the speed of a much smaller model