| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by earslap 816 days ago
	The autoregressive transformer architecture has a constant cost per token, no matter how hard the task is. You can ask the most complicated reasoning question, and it takes the same amount of computation to generate the next token compared to the simplest yes / no question. This is due to architectural constraints. Letting the LLM generate "scratch" data to compute (attend to relevant information) is a way of circumventing the constant cost limitation. The harder the task, the more "scratch" you need so more relevant context is available for future tokens.

1 comments

visarga 816 days ago

That's flatly wrong. Each successive token costs progressively more. The deeper a token is in the sequence, the more past states it has to attend to. As a proof, just remember how slow it gets when the context is large, and how snappy when you first start a chat.

link

earslap 816 days ago

The way I worded it, it might seem wrong - and I agree with you. When I said "constant" I meant without any optimizations to speed up shorter contexts, so with full designed context, architecturally, it is constant. You can pad shorter active contexts with zeroes and avoid attending to empty spaces as an optimization, but that is just an optimization, not an architectural property. If you want "more computation" you fill the context with relevant data (chain of thought, or n-shot stuff), which is the "trick" Karpathy alluded to (it provides more context to attend to), and I agree with that analysis.

link

shawntan 816 days ago

You're both kinda right. The type of computation that happens for that attention step that you refer to is parallel. I would say the thing that is "constant" is the computation graph depth (the number of sequential computations) which is actually important in computing certain functions.

https://blog.wtf.sg/posts/2023-02-03-the-new-xor-problem/

link

visarga 816 days ago

> The type of computation that happens for that attention step that you refer to is parallel

Flash attention, which is widely used, is no longer parallel. The attention matrix is solved batch by batch.

link