Hacker News new | ask | show | jobs
by Hugsun 57 days ago
They're comparing Qwen's moe vs dense (smaller difference) against Gemma's moe vs dense (bigger difference). Your proposed alternative misses the point.
1 comments

Gemma's dense is bigger than its moe's total parameters. You could totally expect the moe to do terribly by comparison.