|
|
|
|
|
by willvarfar
269 days ago
|
|
Reminds me of another Tencent paper https://dl.acm.org/doi/10.1145/3711896.3736949 that is how to combine distillation and ensemble for faster parallel inference. That was Tencent doing parallelism at the model level. And now this is their evolution on MoE. Very complementary. |
|