| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by maplethorpe 51 days ago
	Can someone explain why this is? Do LLMs somehow contain a true random number generator? Why wouldn't they produce the same outputs given the same inputs (even temperature)? edit: I'm not talking about an LLM as accessed through a provider. I'm just talking about using a model directly. Why wouldn't that be deterministic?

4 comments

anon373839 51 days ago

The model outputs a probability distribution for the next token, given the sequence of all previous tokens in the context window. It’s just a list of floats in the same order as the list of tokens that the tokenizer uses.

After that, a piece of software that is NOT the LLM chooses the next token. This is called the sampler. There are different sampling parameters and strategies available, but if you want repeatable* outputs, just take the token with the highest probability number.

* Perfect determinism in this sense is difficult to achieve because GPU calculations naturally have a minor bit of nondeterminism. But you can get very close.

link

2ndorderthought 51 days ago

I'm not so sold the LLM is an LLM without a sampler but it's not worth quibbling over. It's part of the statistical model anyways.

link

vrighter 51 days ago

the llm is the trained part, the rest is the handwritten part. The sampler is handwritten, not learned.

link

2ndorderthought 50 days ago

Believe it or not in statistics and machine learning the hard coded parts of a model that impact the results are considered part of the model. But I understand that now days we don't care about these things because ai goes brrr.

link

nowittyusername 51 days ago

There are A LOT of misconceptions about llms, biggest one is they are not deterministic. And they are 100% deterministic and temperature has nothing to do with it. You WILL get exactly same result every single time (at ANY temperature) as long as you use same sampling parameters and server config parameters. What causes variance in LLM's is server parameters like batch processing and caching among a few other things possibly. the batching being responsible for most of the issues. The reason that flag is used is because large providers serve multiple customers per one gpu, and breaking up the vram is tricky and causes drift. If you start llama.cpp for example with only one person per slot batching off, you will always get same results every time even at temperature 1.2 or whatever other parameters because you are using one gpu per inferance call so no fucky buseness there. Reason most people are unaware of this is because most people have experience only with api instead of working with the actual inferance enjine itself so this godd damned myth keeps spreading. my vide for referance here where you can download and try for yourself. https://www.youtube.com/watch?v=EyE5BrUut2o

link

maplethorpe 51 days ago

Thanks so much for this! I still haven't got around to building my own language model yet, so I'm a bit fuzzy on the details, but if I imagined a thought experiment where I did all the math by hand on paper, I just couldn't see how I would end up with a different output each time given the same inputs. Finding out that the variance other people are seeing comes from the server/hardware stuff clears that up.

This is a surprisingly annoying question to Google. A lot of articles give the reason that softmax returns a probability distribution, as if the presence of the word "probability" means the tokens will be different every time.

link

evrydayhustling 51 days ago

An LLM model itself -- that is, the weights and the mathematical functions linking them -- does not tell you exactly how to train from data, nor how to generate an output. Instead, it describes a function providing relative likelihood(output | input).

Deciding how to pick a particular output given that likelihood function is left as an exercise for the user, which we call inference.

One obvious choice is to keep picking the highest likelihood token, feed it into the model, and get another -- on repeat. This is what most algorithms call "temperature=0". But doing this for token after token can lead boring output, or steer you into pathological low-probability sequences like a set of endless repeats.

So, the current SOTA is to intentionally introduce a random factor (temperature>0) to the sampling process -- along with other hacks, like explicit suppression of repeats.

link

2ndorderthought 51 days ago

Yea sure. So temperature is baked into these LLM models and when it isn't zero it increases the probability of taking a different path to decode the tokens. Whether it's at a provider or downloaded on your own machine.

Technically even when the temperature is 0 it's not deterministic but it's more likely to be... You can have ties in probabilities for generating the next words. And floating point noise is real.

All these models are doing is guesstimating the next token to say.

link