Hacker News new | ask | show | jobs
by chillee 297 days ago
The 32 parallel sequences is also arbitrary and significantly changes your conclusions. For example, if they run with 256 parallel sequences then that would result in a 8x cheaper factor in your calculations for both prefill and decode.

The component about requiring long context lengths to be compute-bound for attention is also quite misleading.

1 comments

Anyone up to publishing their own guess range?