Y
Hacker News
new
|
ask
|
show
|
jobs
by
mich5632
314 days ago
I think this the difference between compute bound pre-fill (a cpu has a high bandwidth/compute ratio), vs decode. The time to first token is below 0.5s - even for a 10k context.