Hacker News new | ask | show | jobs
by perching_aix 350 days ago
> so I can't just store my prompts in git and know that I'll get the same behavior each time.

Yes you can, albeit it's pretty silly to do so. LLMs are not (inherently) nondeterministic, you're just not pinning the seed, or are using a remote, managed service for them with no reliability and consistency guarantees. [0]

Here, experiment with me [1]:

- download Ollama 0.9.3 (current latest)

- download the model gemma3n e4b (digest: 15cb39fd9394) using the command "ollama run gemma3n:e4b"

- pin the seed to some constant; let's use 42 as an example, by issuing the following command in the ollama interactive session started in the previous step: "/set parameter seed 42"

- prompt the model: "were the original macs really exactly 9 inch?"

It will respond with:

> You're right to question that! The original Macintosh (released in 1984) *was not exactly 9 inches*. It was marketed as having a *9-inch CRT display*, but the actual usable screen size was a bit smaller. (...)

Full response here: https://pastebin.com/PvFc4yH7

The response should be the same over time, across all devices, regardless of whether GPU-acceleration is available.

Bit of an aside, but the overall sentiment echoed in the article reminds me to how visual programming was going to revolutionize everything and take programmers' jobs. With the exception that AI I find actually useful, and was able to integrate it into my workflow.

[0] All of this is to say, to the extent LLM nondeterminism is currently a model trait, it is substituted using a PRNG. Actual nondeterminism is at most an inference engine trait typically instead, see e.g. batched inference.

[1] Details are for experiment reproduction purposes. You can substitute the listed inference engine, model, seed, and prompt with whatever your prefer for your own set of experiments.

2 comments

I tried recreating this - gemma3n:e4b ID 15cb39fd9394 and /set parameter seed 42 and I got 2 different results on the two times I asked 'were the original macs really exactly 9 inch?'

Strangely, I'm only getting 2 alternating results every time I restart the model. I was not able to get the same result as you and certainly not with links to external sources. Is there anything else I could do to try to replicate your result?

I've only used ChatGPT prior and it'd be nice to use locally run models with consistent results.

Hmm, that's pretty strange. Not sure what might be going wrong, could be terminal shenanigans or a genuine bug in ollama.

Of course, double checking the basics would be a good thing to cover: ollama --version should return ollama 0.9.3, and the prompt should be copied and pasted to ensure it's byte-exactly matching.

Maybe you could also try querying the model through its API (localhost:11434/api/generate)? I'll ask a colleague to try and repro on his Mac like last time just to double check. I also tried restarting the model a few times, worked as expected here.

*Update:* getting some serious reproducibility issues across different hardware. A month ago the same experiment with regular quantized gemma3 worked fine between GPU, CPU, and my colleague's Mac, this time the responses differ everywhere (although they are consistent between resets on the same hw). Seems like this model may be more sensitive to hardware differences? I can try generating you a response with regular gemma3 12b qat if you're interested in comparing that.

Yeah trying out gemma3 12b qat sounds great.

I got back 0.9.3 as well as copied and pasted the prompt (included quotes and no quotes as well just in case...)

I can try the API as well and I'm using a legion 15ach6 but I could also try on my MacBook Pro.

Okay, this has been a ride.

Reverted to 0.8.0 of ollama, switched to gemma3:12b-it-qat for the model, set the seed to 42 and the temp to 0, and used my old prompt. This way I was able to get consistent results everywhere, and could confirm from old screenshots everything still matches.

Prompt and output here: https://pastebin.com/xUi3bbGh

However, when using the prompt I used previously in this thread, I'm getting a different response between machines, even with the temp and seed pinned. On the same machine, I initially found that it's reliably the same, but after running it a good few times more, I was eventually able to get the flip-flopping behavior you describe.

API wise, I just straight up wasn't able to get consistent results at all, so that was a complete bust.

Ultimately, it seems like I successfully fooled myself in the past and accidentally cherry picked an example? Or at least it's way more brittle than I thought. At this point I'd need significantly more insight into how the inference engine (ollama) works to be able to definitively ascertain whether this is a model or an engine trait, and whether it is essential for the model to work (although I'm still convinced it isn't). Not sure if that helps you much in practice though.

I wouldn't make a good scientist, apparently :)

I appreciate the effort! And I disagree - that's what it's all about haha

I assume there are more levers we could try pulling to reduce variation? I'll be looking into this as well.

As an aside, because of my own experience with variability using chatGPT (non-API, I assume there are also more levers to pull here), I've been thinking about LLMs and their application to gaming. To what extent it is possible to use LLMs to interpret a result and then return a variable that then executes the usual state updates? This would hopefully add a bit of intentional variability in the game's response to user inputs but consistency in updating internal game logic.

edit: found this! https://github.com/rasbt/LLMs-from-scratch/issues/249 Seems that it's an ongoing issue from various other links I've found, and now when I google "ollama reproducibility" this thread comes up on the first page, so it seems it's an uncommon issue as well :(

It's only deterministic with that exact model and that exact seed. So you are SOL unless you have a copy of that model.
And that exact prompt and that exact inference engine [version].

Pretty reasonable if you ask me. All of this was to say, these are still programs, all the regular sensibilities still apply. Heck, even for that, the sensibilities that apply are pretty old-school: in modern, threaded applications, you'd expect runtime behavioral variations. Not the case here. Even for the high-level language compilers referred to in the article, this doesn't apply so easily. The folks over at reproducible builds [0] put in a decent bit of effort to my knowledge to make it happen.

The overarching point being that it's not magic: it's technology. And if you hold them to even broadly similar standards you hold compilers to, they are absolutely deterministic.

In case you mean that if you pick anything else other than what I picked here, the process ceases to be deterministic, that is not true. You can trivially test and confirm that the same way I did.

[0] https://reproducible-builds.org/

This is really brittle, though, and in a way that matters a lot more than reproducible builds. Like, sure: a slightly different compiler might cause a slightly different bug to be introduced into the code, but here a slight bit of non-determinism leads to a drastically and fundamentally different result! And not only is the exact inference engine version important, but sometimes the exact hardware as well (when run on GPU)...
My pet peeve here is mistaking nondeterminism with model unreliability. The issue is not that these models or inference engines are inherently nondeterministic, they aren't, but that the former are just plain unreliable.

I can ask the example question multiple times, and often it will tell me absolute balooney. This is not the model being nondeterministic, it's the model being not very good. It's inconsistent across what we consider semantically equivalent prompts. This is a gap in alignment, not some sort of computational property, like nondeterminism. If you ask it 4+4 in multiple different ways, it will always(?) reply correctly, but ask it something more complex, and you can get seriously different results.

The same exchange can be played out multiple times again and again, and it will reply with the exact same misunderstandings in the exact same order again and again, byte to byte matching. The kicker here is batched inference being a necessity for economical reasons at scale, and exactly matching outputs being impractical because of their non-formal target usage. So you get these wishy washy back and forths about whether the models are deterministic or not, which is super frustrating and misses the point.

And I have to keep dodging bullets like "well if you swap the hardware from under it it might change" or "if you prompt a remote service where they control what's deployed and don't tell you fully, and batch your prompt together with a bunch of other people's which you have no control over, it will not be reproducible". Yes, it might not be, obviously. Even completely deterministic code can become nondeterministic if you run it on top of an execution environment that makes it so. That's not exactly a reasonable counter I don't think.

With regular software, there's a contract (the specifications) that code is developed against, and variances are constrained within there. This is not the case with LLMs, the variances are letting jesus take over the wheel tier, and that's a perfectly fine retort. But then it's not nondeterminism that's the issue.

> Pretty reasonable if you ask me.

Where can i download ChatGPT models?

> The folks over at reproducible builds

I like those folks (hello hboeck!), but their work are unrelated to determinstic LLM output, so why even bring them up here?

> The overarching point being that it's not magic: it's technology.

Yea are you even responding to me, or is this just a stream of thought?

Noone said LLM is magic.

> Where can i download ChatGPT models?

Nowhere. Does that make LLMs nondeterministic?

> I like those folks (hello hboeck!), but their work are unrelated to determinstic LLM output, so why even bring them up here?

Did you read the blogpost in the OP?

I cannot understand what I wrote for you on your behalf. What could possibly be unclear about this? Or was this another rhetorical like the previous one, to polish your snark?

> Yea are you even responding to me, or is this just a stream of thought?

Yes, I was. Are you able to ask non-rhetorical questions too? Should I have asked for your blessed permission to write more than just a direct address of what you asked about?

>> Where can i download ChatGPT models?

> Nowhere. Does that make LLMs nondeterministic?

It does, yes.

I can not reproduce the same output without access to the model.

Hence, not deterministic.

I see, that's a very entertaining way of reasoning about the determinism of an entire class of software, thanks for sharing.