|
|
|
|
|
by radq
1089 days ago
|
|
If it's similar to the switch transformer architecture [1], which I suspect it is, then the models are all trained on the same corpus and the routing model learns automatically which experts to route to. It's orthogonal to beam search - the benefit of the architecture is that it allows sparse inference. [1] https://arxiv.org/pdf/2101.03961.pdf |
|