Hacker News new | ask | show | jobs
by dgreensp 1035 days ago
My understanding (not an expert) is that the time for a LLM to produce an output is linear in the length of the output, but may not be in the length of the input (i.e. context). It may be quadratic in the context, or using some kind of fancy attention optimization.
1 comments

Yeah... But for every new token you generate, you need to take that into account, along with all prior generated tokens and input provided by user for generating the next one.