Y
Hacker News
new
|
ask
|
show
|
jobs
by
orbital-decay
335 days ago
Inseparable, routing is done per token in a statistically optimal way, not per request on the knowledge domain basis.
1 comments
viraptor
335 days ago
Sure, it's done per token, but the question is: how much do the knowledge domains match up with experts. I could not find hard data on this.
link
boroboro4
334 days ago
Check out DeepSeek v3 model paper. They changed the way they train experts (went from aux loss to different kind expert separation training). It did improve experts domain specialization, they have neat graphics on it in the paper.
link