Hacker News new | ask | show | jobs
by chessgecko 780 days ago
Having played with this stuff its definitely spots in the expert buffers (the other comment in the thread has the link to explanation) and not the extremely small differences in floating point arithmetic. The effect from this is much much less than any change in quantization, i.e. almost impossible to see from the outputs.
1 comments

I guess the root cause of my claim is that OpenAI won't tell us whether or not GPT-3.5 is an MoE model, and I assumed it wasn't. Since GPT-3.5 is clearly nondeterministic at temp=0, I believed the nondeterminism was due to FPU stuff, and this effect was amplified with GPT-4's MoE. But if GPT-3.5 is also MoE then that's just wrong.

What makes this especially tricky is that small models are truly 100% deterministic at temp=0 because the relative likelihoods are too coarse for FPU issues to be a factor. I had thought 3.5 was big enough that some of its token probabilities were too fine-grained for the FPU. But that's probably wrong.

On the other hand, it's not just GPT, there are currently floating-point difficulties in vllm which significantly affect the determinism of any model run on it: https://github.com/vllm-project/vllm/issues/966 Note that a suggested fix is upcasting to float32. So it's possible that GPT-3.5 is using an especially low-precision float and introducing nondeterminism by saving money on compute costs.

Sadly I do not have the money[1] to actually run a test to falsify any of this. It seems like this would be a good little research project.

[1] Or the time, or the motivation :) But this stuff is expensive.