| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dramlord 921 days ago
	You can amortize memory loading with large continuous batching. I imagine more compute would help the problem for certain workloads like speculative decoding

1 comments

Batching helps throughput and anyone running in production will be doing batching.

But it's not free, and still comes at a cost of per-stream latency.

Speculative decoding seems less effective in practice than in theory.