| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by martinald 291 days ago
	Thanks for the correction (author here). I'll update the article - very fair point on compute on input tokens which I messed up. Tbh I'm pleased my napkin math was only 7x off the laws of physics :). Even rerunning the math on my use cases with way higher input token cost doesn't change much though.

1 comments

chillee 291 days ago

The 32 parallel sequences is also arbitrary and significantly changes your conclusions. For example, if they run with 256 parallel sequences then that would result in a 8x cheaper factor in your calculations for both prefill and decode.

The component about requiring long context lengths to be compute-bound for attention is also quite misleading.

link

Barbing 291 days ago

Anyone up to publishing their own guess range?

link