Hacker News new | ask | show | jobs
by mich5632 314 days ago
I think this the difference between compute bound pre-fill (a cpu has a high bandwidth/compute ratio), vs decode. The time to first token is below 0.5s - even for a 10k context.