Hacker News new | ask | show | jobs
by Tostino 1035 days ago
I don't believe this is correct from my direct experience running local models. They generate slower when you start to fill up the context window compared to when it first starts responding.
1 comments

My understanding (not an expert) is that the time for a LLM to produce an output is linear in the length of the output, but may not be in the length of the input (i.e. context). It may be quadratic in the context, or using some kind of fancy attention optimization.
Yeah... But for every new token you generate, you need to take that into account, along with all prior generated tokens and input provided by user for generating the next one.