| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by popinman322 481 days ago
	You can swap experts in and out of VRAM, it just increases inference time substantially. Depending on the routing function you can figure out all the active experts ahead of the forward pass for a single token and pipeline the expert loading.

1 comments

Chosen expert (on each layer) depends on the input of previous layer. Not sure how you can preload the experts before forward pass.