| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ydj 86 days ago
	Not necessarily. Servers serving the model likely has enough traffic that they are batching decodes already. MTP reduces latency and increase efficiency only when the server can’t batch enough concurrent streams to be compute bound rather than memory bound.

1 comments

Fair didn't think about batching makes more sense for self hosted models then.