|
|
|
|
|
by frotaur
83 days ago
|
|
Afaik the experts are not usually very interpretable, and generally would be surprised if at least one does not change every token. I don't know what happens in practice, but I know at least during training, nothing is done to minimize the number of expert switches between tokens. |
|
If one is to use these on hardware that can't keep everything loaded I guess someone should examine how it works out in practice. Interpretability may be be a too much to ask, but I can't spontaneously see any reason why the experts can't at least be pushed to incorporate what's needed to remain the good choice for a longer segment.