Hacker News new | ask | show | jobs
by perching_aix 348 days ago
Hmm, that's pretty strange. Not sure what might be going wrong, could be terminal shenanigans or a genuine bug in ollama.

Of course, double checking the basics would be a good thing to cover: ollama --version should return ollama 0.9.3, and the prompt should be copied and pasted to ensure it's byte-exactly matching.

Maybe you could also try querying the model through its API (localhost:11434/api/generate)? I'll ask a colleague to try and repro on his Mac like last time just to double check. I also tried restarting the model a few times, worked as expected here.

*Update:* getting some serious reproducibility issues across different hardware. A month ago the same experiment with regular quantized gemma3 worked fine between GPU, CPU, and my colleague's Mac, this time the responses differ everywhere (although they are consistent between resets on the same hw). Seems like this model may be more sensitive to hardware differences? I can try generating you a response with regular gemma3 12b qat if you're interested in comparing that.

1 comments

Yeah trying out gemma3 12b qat sounds great.

I got back 0.9.3 as well as copied and pasted the prompt (included quotes and no quotes as well just in case...)

I can try the API as well and I'm using a legion 15ach6 but I could also try on my MacBook Pro.

Okay, this has been a ride.

Reverted to 0.8.0 of ollama, switched to gemma3:12b-it-qat for the model, set the seed to 42 and the temp to 0, and used my old prompt. This way I was able to get consistent results everywhere, and could confirm from old screenshots everything still matches.

Prompt and output here: https://pastebin.com/xUi3bbGh

However, when using the prompt I used previously in this thread, I'm getting a different response between machines, even with the temp and seed pinned. On the same machine, I initially found that it's reliably the same, but after running it a good few times more, I was eventually able to get the flip-flopping behavior you describe.

API wise, I just straight up wasn't able to get consistent results at all, so that was a complete bust.

Ultimately, it seems like I successfully fooled myself in the past and accidentally cherry picked an example? Or at least it's way more brittle than I thought. At this point I'd need significantly more insight into how the inference engine (ollama) works to be able to definitively ascertain whether this is a model or an engine trait, and whether it is essential for the model to work (although I'm still convinced it isn't). Not sure if that helps you much in practice though.

I wouldn't make a good scientist, apparently :)

I appreciate the effort! And I disagree - that's what it's all about haha

I assume there are more levers we could try pulling to reduce variation? I'll be looking into this as well.

As an aside, because of my own experience with variability using chatGPT (non-API, I assume there are also more levers to pull here), I've been thinking about LLMs and their application to gaming. To what extent it is possible to use LLMs to interpret a result and then return a variable that then executes the usual state updates? This would hopefully add a bit of intentional variability in the game's response to user inputs but consistency in updating internal game logic.

edit: found this! https://github.com/rasbt/LLMs-from-scratch/issues/249 Seems that it's an ongoing issue from various other links I've found, and now when I google "ollama reproducibility" this thread comes up on the first page, so it seems it's an uncommon issue as well :(