| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jmuguy 47 days ago
	Are you arguing that the output of an LLM isn’t random?

3 comments

mpyne 47 days ago

It is random if you select it to be (temperature != 0, etc.).

It is not random if you don't use random sampling in its output generation.

It the whole thing were actually stochastic then prompt caching would be impossible because having a cache of what the previous tokens transformed into to speed up future generation would be invalidated by the missing random state.

Look at llama.cpp, you can see what samplers are adjustable and if you use samplers that employ randomness you can see what settings disable the random sampling. Or you can employ randomness but fix the seed to get reproducible results.

link

sumeno 47 days ago

Yes, it can still be random with temperature set to 0. It'll only be the same if you run it on exactly the same hardware every single time.

link

philipswood 47 days ago

An LLM is a set of structured matrix multiplies and function applications. The only potentially non-deterministic step is selecting the next token from the final output and that can be done deterministically.

link

jmalicki 47 days ago

Matrix multiplication on GPUs is non-deterministic. As are things like cumsum()

https://docs.pytorch.org/docs/2.11/generated/torch.use_deter...

This comes down to map reduce and floating point's lack of associativity. You see the same thing with OpenMP on CPUs.

People are constantly claiming determinism in LLMs that is just not there.

link

zadikian 47 days ago

Even if it were reproducible, realistically most people are using some service like Claude that makes no guarantee that the model or hardware didn't change. Which is fine, it doesn't need reproducibility.

This is interesting though, I didn't know PyTorch had a debug mode for reproducibility.

link

jmalicki 46 days ago

Even with this debug mode, a different batch size can give different results for the same input - e.g. your tensor multiplies might use different blocking, hence different associativity.

I posted that to show that at a bare minimum, there is some pretty extreme nondeterminism (though probably mild in effect) in even the most pedestrian GPU inference, unless you go to the extreme of using the debug mode and taking the potential performance hit.

link

vrighter 47 days ago

well just run all inference on the cpu, single threaded /s

link

8note 47 days ago

random isnt the right term.

ill conditioned or unstable is better

a small change in the input can create a large difference in the output.

link