Hacker News new | ask | show | jobs
by zozbot234 2 days ago
> Why? Why do you think that's the case? Part of the training is balancing load between experts.

The training balances expert choice across the entire scope of the model. Experiments have consistently shown that within a given session or topic (taken in a broad sense) expert choice is biased in a way that's likely to make caching useful and reuse across a user-specific batch realistic.