Hacker News new | ask | show | jobs
by BoorishBears 314 days ago
MoE expected performance = sqrt(active heads * total parameter count)

sqrt(120*5) ~= 24

GPT-OSS 120B is effectively a 24B parameter model with the speed of a much smaller model