| Okay, this has been a ride. Reverted to 0.8.0 of ollama, switched to gemma3:12b-it-qat for the model, set the seed to 42 and the temp to 0, and used my old prompt. This way I was able to get consistent results everywhere, and could confirm from old screenshots everything still matches. Prompt and output here: https://pastebin.com/xUi3bbGh However, when using the prompt I used previously in this thread, I'm getting a different response between machines, even with the temp and seed pinned. On the same machine, I initially found that it's reliably the same, but after running it a good few times more, I was eventually able to get the flip-flopping behavior you describe. API wise, I just straight up wasn't able to get consistent results at all, so that was a complete bust. Ultimately, it seems like I successfully fooled myself in the past and accidentally cherry picked an example? Or at least it's way more brittle than I thought. At this point I'd need significantly more insight into how the inference engine (ollama) works to be able to definitively ascertain whether this is a model or an engine trait, and whether it is essential for the model to work (although I'm still convinced it isn't). Not sure if that helps you much in practice though. I wouldn't make a good scientist, apparently :) |
I assume there are more levers we could try pulling to reduce variation? I'll be looking into this as well.
As an aside, because of my own experience with variability using chatGPT (non-API, I assume there are also more levers to pull here), I've been thinking about LLMs and their application to gaming. To what extent it is possible to use LLMs to interpret a result and then return a variable that then executes the usual state updates? This would hopefully add a bit of intentional variability in the game's response to user inputs but consistency in updating internal game logic.
edit: found this! https://github.com/rasbt/LLMs-from-scratch/issues/249 Seems that it's an ongoing issue from various other links I've found, and now when I google "ollama reproducibility" this thread comes up on the first page, so it seems it's an uncommon issue as well :(