Y
Hacker News
new
|
ask
|
show
|
jobs
by
zackangelo
217 days ago
What 1T parameter base model have you seen from any of those labs?
1 comments
riku_iki
217 days ago
its moe, each expert tower can be branched from some smaller model.
link
jychang
213 days ago
That's not how MoE works, you need to train the FFN directly or else the FFN gate would have no clue how to activate the expert.
link