Hacker News new | ask | show | jobs
by jiggawatts 1035 days ago
The training cost scales as n^2 where ‘n’ is the maximum token window size. During inference the compute cost is constant per token.

Disclaimer: I’m not an expert, this is just what I’ve picked up reading about the technology.

2 comments

Also not an expert, but I believe it is a little bit of both for inference.

If you are generating token-by-token naively, you do need to pay the n^2 cost since every token must attend to all other tokens. Generating a sequence of 5 tokens starting from 2 (infer 3rd, infer 4th, infer 5th...) will be much faster than starting from 1024 (infer 1025th, infer 1026th, ...) since your n is smaller. But each time time it is n^2.

However that is a naive approach. There is a common optimization, KV caching[1], (on by default for HuggingFace models) that caches all the work from the prior step so you only have to compute the attention for the new token. So you get something like (infer 1025th, cache 1025 attend new 1 token over the other 1025, cache 1026 attend 1 new token over the other 1026, ...). Not quite constant time but much better than n^2.

I would imagine there are other optimizations too, but this is the one I've heard of.

[1] fairly code-y, but links to some other posts at the start: https://www.dipkumar.dev/becoming-the-unbeatable/posts/gpt-k...

I don't believe this is correct from my direct experience running local models. They generate slower when you start to fill up the context window compared to when it first starts responding.
My understanding (not an expert) is that the time for a LLM to produce an output is linear in the length of the output, but may not be in the length of the input (i.e. context). It may be quadratic in the context, or using some kind of fancy attention optimization.
Yeah... But for every new token you generate, you need to take that into account, along with all prior generated tokens and input provided by user for generating the next one.