Hacker News new | ask | show | jobs
by reissbaker 617 days ago
Phi 3.5 is pretty bad in practice, the Phi series always benchmarks well on the popular benchmarks and then falls over IRL (or on less-popular benchmarks). It would be nice to see it against Qwen2.5, but the Qwen team didn't release any evals on the 7B version AFAIK, so I can see why the Zamba folks compared it against other published benchmarks of similar-sized models.

In general the idea with these hybrid SSM architectures is to show that you can get good results with fewer training tokens, and to significantly improve inference speed. Even if Qwen2.5 was better at MMLU, etc, it definitely used way more training tokens to get there (18T tokens for Qwen2.5 vs 3T for Zamba2), so Zamba2 is still a pretty useful result.

TBD if Zamba2 is actually good in real world usage (Phi3.5 for example used only 3.4T tokens and got good public benchmark results, it's just not very good at anything other than the public benchmarks), but Jamba1.5 -- another hybrid SSM architecture -- did seem to do quite well on the LMSys leaderboards (which are admittedly these days not a super effective measure, but still feel less gameable than MMLU), so I'm moderately hopeful that this is a real architectural win and not just gamed benchmarks.