Hacker News new | ask | show | jobs
by famouswaffles 849 days ago
There's fixed compute per token but more tokens = more compute so a LLM will technically have more "time" for a query with more tokens preceding it.
1 comments

A key aspect is the information bottleneck enforced by the mechanism as the next "iteration" only gets to access the new token computed and discards all the other information it computed.

So if you want it to spend more "time" in a useful manner without changing the architecture, you have to get it to write down the temporary information in the tokens, as "think step by step" does or alternatively iterative prompts "write a draft for the rough structure" "now rewrite it better with more detail".

This blew my mind a little as it feels unintuitive to do this since you wouldn't just forget what you based your previous reply on, at least not after some practice with your mind and memory (which I need to catch up on, I must add).

It also feels like a multiplication of required processing power but I have no clue yet how one could use the previous generation of weights of and the tokens themselves to improve, elaborate on, widen the range of predicted potential results.