| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by willvarfar 269 days ago
	Reminds me of another Tencent paper https://dl.acm.org/doi/10.1145/3711896.3736949 that is how to combine distillation and ensemble for faster parallel inference. That was Tencent doing parallelism at the model level. And now this is their evolution on MoE. Very complementary.