Hacker News new | ask | show | jobs
by PeterisP 849 days ago
No, the standard LLM implementations currently used will apply a fixed amount of computations during inference, which is chosen and "baked in" by the model architecture before training. They don't really have the option to "think a bit more" before giving the answer, generating each token makes the exact same amount of matrix multiplications. Well, they probably theoretically could be modified to do it, but we don't do that properly yet, even if some styles of prompts e.g. "let's think step by step" kind of nudge the model in that direction.

The same model will give the same result, and more processing power will simply enable you to get the inference done faster.

On the other hand, more resources may enable (or be required for) a different, better model.

2 comments

> the same model with give the same result

Is it wrong to think of this as misleading? Don't the results for exactly the same request differ because there are multiple output strings with the same computed weights?

Or do you include "multiple ways to phrase the same" in "same results" and I'm being a noob?

There is certain intentional randomness in how the tokens are selected, and certain unintentional randomness due to letting some optimizations cause small side-effects, but in any case in that sentence I didn't really intended to talk about the result being identical but rather about the result not being any better just because more compute was available, as by default that extra available potential simply wouldn't get used in any way other than getting a speedup.
There's fixed compute per token but more tokens = more compute so a LLM will technically have more "time" for a query with more tokens preceding it.
A key aspect is the information bottleneck enforced by the mechanism as the next "iteration" only gets to access the new token computed and discards all the other information it computed.

So if you want it to spend more "time" in a useful manner without changing the architecture, you have to get it to write down the temporary information in the tokens, as "think step by step" does or alternatively iterative prompts "write a draft for the rough structure" "now rewrite it better with more detail".

This blew my mind a little as it feels unintuitive to do this since you wouldn't just forget what you based your previous reply on, at least not after some practice with your mind and memory (which I need to catch up on, I must add).

It also feels like a multiplication of required processing power but I have no clue yet how one could use the previous generation of weights of and the tokens themselves to improve, elaborate on, widen the range of predicted potential results.