|
|
|
|
|
by jychang
200 days ago
|
|
Mamba based LLMs aren't even close to novel though. IBM's been doing this since forever [1]. Also, you're off on Deepseek V3.2's param count, the full model's 685B in size with the MTP layer. I don't think there's anything interesting here other than "I guess AMD put out a research paper", and it's not cutting edge when Deepseek or even IBM is running laps around them. [1] Here's a news article from April, although IBM has been doing it for a long time before that https://research.ibm.com/blog/bamba-ssm-transformer-model |
|