Hacker News new | ask | show | jobs
by rfoo 321 days ago
Or, you can say, OpenAI has some real technical advancements on stuff besides attn architecture. GQA8, alternating SWA 128 / full attn do all seem conventional. Basically they are showing us that "no secret sauce in model arch you guys just sucks at mid/post-training", or they want us to believe this.

The model is pretty sparse tho, 32:1.

2 comments

Kimi K2 paper said that the model sparsity scales up with parameters pretty well (MoE sparsity scaling law, as they call, basically calling Llama 4 MoE "done wrong"). Hence K2 has 128:1 sparsity.
I thought Kimi K2 uses 8 active experts out of 384? Sparsity should be 48:1. Indeed Llama4 Maverick is the only one that has 128:1 sparsity.
You are right. I mis-remembered the sparsity part of K2. The "done wrong" part I was thinking about how the scout -> maverick -> behemoth doesn't scale sparsity according to any formula (less sparse -> sparse -> less sparse).
> how the scout -> maverick -> behemoth doesn't scale sparsity according to any formula (less sparse -> sparse -> less sparse)

Ah I see. I didn't notice that behemoth has the same sparsity as scout. That seems quite random indeed.

It's convenient to be able to attribute success to things only OpenAI could've done with the combo of their early start and VC money – licensing content, hiring subject matter experts, etc. Essentially the "soft" stuff that a mature organization can do.