|
|
|
|
|
by jychang
146 days ago
|
|
They didn't do something stupid like Llama 4 "one active expert", but 4 of 256 is very sparse. It's not going to get close to Deepseek or GLM level performance unless they trained on the benchmarks. I don't think that was a good move. No other models do this. |
|