| > These flags don't magically change LLM formalisms. You can read more about how floating point operations produce non-determinism here: Basically what you're saying is "for 99.9% of use cases and how people use them they are non-deterministic, and you have to very carefully work around that non-determinism to the point of having workarounds for your GPU and making them even more unusable" > In this context, forcing single-threading bypasses FP-hardware's non-associativity issues that crop up with multi-threaded reduction. Translation: yup, they are non-deterministic under normal conditions. Which the paper explicitly states: --- start quote --- existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the non-associativity of floating-point arithmetic and inconsistent reduction orders across GPUs. --- end quote --- > If you still don't have bit-replicated outputs for the same input sequence, either something is seriously wrong with your computer or you should get in touch with a reputable metatheoretician because you've just discovered something very significant. Basically what you're saying is: If you do all of the following, then the output will be deterministic: - workaround for GPUs with num_thread 1 - temperature set to 0 - top_k to 0 - top_p to 0 - context window to 0 (or always do a single run from a new session) Then the output will be the same all the time. Otherwise even "non-shitty corp runners" or whatever will keep giving different answers for the same question: https://gist.github.com/dmitriid/5eb0848c6b274bd8c5eb12e6633... Edit. So what we should be saying is that "LLM models as they are normally used are very/completely non-deterministic". > Perhaps in the future you can learn from this experience and start with a post like the first part of this So why didn't you? |