Hacker News new | ask | show | jobs
by riku_iki 217 days ago
its moe, each expert tower can be branched from some smaller model.
1 comments

That's not how MoE works, you need to train the FFN directly or else the FFN gate would have no clue how to activate the expert.