| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kherud 1035 days ago
	Why is this the expected result? The original transformer algorithm has a n^2 computational complexity, where n is the amount of tokens. As far as I know, there are some improvements which bring it down to something like n*log(n). A linear complexity seems surprising however. Is the reason that calculating the attention can be completely parallelized with decent hardware, so the response time stays linear?

1 comments

jiggawatts 1035 days ago

The training cost scales as n^2 where ‘n’ is the maximum token window size. During inference the compute cost is constant per token.

Disclaimer: I’m not an expert, this is just what I’ve picked up reading about the technology.

link

smpanaro 1034 days ago

Also not an expert, but I believe it is a little bit of both for inference.

If you are generating token-by-token naively, you do need to pay the n^2 cost since every token must attend to all other tokens. Generating a sequence of 5 tokens starting from 2 (infer 3rd, infer 4th, infer 5th...) will be much faster than starting from 1024 (infer 1025th, infer 1026th, ...) since your n is smaller. But each time time it is n^2.

However that is a naive approach. There is a common optimization, KV caching[1], (on by default for HuggingFace models) that caches all the work from the prior step so you only have to compute the attention for the new token. So you get something like (infer 1025th, cache 1025 attend new 1 token over the other 1025, cache 1026 attend 1 new token over the other 1026, ...). Not quite constant time but much better than n^2.

I would imagine there are other optimizations too, but this is the one I've heard of.

[1] fairly code-y, but links to some other posts at the start: https://www.dipkumar.dev/becoming-the-unbeatable/posts/gpt-k...

link

Tostino 1035 days ago

I don't believe this is correct from my direct experience running local models. They generate slower when you start to fill up the context window compared to when it first starts responding.

link

dgreensp 1035 days ago

My understanding (not an expert) is that the time for a LLM to produce an output is linear in the length of the output, but may not be in the length of the input (i.e. context). It may be quadratic in the context, or using some kind of fancy attention optimization.

link

Tostino 1035 days ago

Yeah... But for every new token you generate, you need to take that into account, along with all prior generated tokens and input provided by user for generating the next one.

link