|
|
|
|
|
by rfoo
321 days ago
|
|
Or, you can say, OpenAI has some real technical advancements on stuff besides attn architecture. GQA8, alternating SWA 128 / full attn do all seem conventional. Basically they are showing us that "no secret sauce in model arch you guys just sucks at mid/post-training", or they want us to believe this. The model is pretty sparse tho, 32:1. |
|