| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rfoo 321 days ago
	Or, you can say, OpenAI has some real technical advancements on stuff besides attn architecture. GQA8, alternating SWA 128 / full attn do all seem conventional. Basically they are showing us that "no secret sauce in model arch you guys just sucks at mid/post-training", or they want us to believe this. The model is pretty sparse tho, 32:1.

2 comments

liuliu 321 days ago

Kimi K2 paper said that the model sparsity scales up with parameters pretty well (MoE sparsity scaling law, as they call, basically calling Llama 4 MoE "done wrong"). Hence K2 has 128:1 sparsity.

link

throwdbaaway 321 days ago

I thought Kimi K2 uses 8 active experts out of 384? Sparsity should be 48:1. Indeed Llama4 Maverick is the only one that has 128:1 sparsity.

link

liuliu 320 days ago

You are right. I mis-remembered the sparsity part of K2. The "done wrong" part I was thinking about how the scout -> maverick -> behemoth doesn't scale sparsity according to any formula (less sparse -> sparse -> less sparse).

link

throwdbaaway 320 days ago

> how the scout -> maverick -> behemoth doesn't scale sparsity according to any formula (less sparse -> sparse -> less sparse)

Ah I see. I didn't notice that behemoth has the same sparsity as scout. That seems quite random indeed.

link

nxobject 321 days ago

It's convenient to be able to attribute success to things only OpenAI could've done with the combo of their early start and VC money – licensing content, hiring subject matter experts, etc. Essentially the "soft" stuff that a mature organization can do.

link