| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by FL33TW00D 483 days ago
	It depends on the batch size and the accelerator you're running on! Decode is typically memory bound unless you can hit high batch sizes (in the hundreds), which is hard during serving due to the contention between batch size and low TTFT. https://jax-ml.github.io/scaling-book/inference/ - good read!