| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ukuina 781 days ago

This is the most applicable part of the article:

Strategies to improve LLM accuracy:

Retry: We repeatedly invoke a model with the temperature set to zero, up to five times, if it fails the test cases provided with the problem description. Retrying makes sense because LLMs aren’t deterministic even at temperature zero.

Warming: This is the same as the retry strategy, but we gradually increase the temperature of the underlying model with each run, from 0 to 0.5. This increases the stochasticity of the model and, we hope, increases the likelihood that at least one of the retries will succeed.

Escalation: We start with a cheap model (Llama-3 8B) and escalate to more expensive models (GPT-3.5, Llama-3 70B, GPT-4) if we encounter a test case failure.

2 comments

vok 781 days ago

These strategies seem immediately practical. If you want to go beyond zero-shot for LLM coding, you may not need a complicated agent architecture - just start with escalation, retry, and warming.

link

smaddox 781 days ago

> Retrying makes sense because LLMs aren’t deterministic even at temperature zero.

This is news to me. I'm trying to think where non-determinism would come in at temperature zero, but coming up with nothing. What am I missing?

link

wongarsu 781 days ago

It can happen due to a number of reasons, but in the case of GPT-4 it's probably because of their MoE implementation

https://152334h.github.io/blog/non-determinism-in-gpt-4/

link

nicklecompte 781 days ago

It's because floating-point arithmetic isn't deterministic, which becomes salient when (speaking loosely) the difference between likelihood of two different tokens is less than the precision of the FPU.

I am not sure to what extent this effect has been quantified.

link

chessgecko 781 days ago

Having played with this stuff its definitely spots in the expert buffers (the other comment in the thread has the link to explanation) and not the extremely small differences in floating point arithmetic. The effect from this is much much less than any change in quantization, i.e. almost impossible to see from the outputs.

link

nicklecompte 781 days ago

I guess the root cause of my claim is that OpenAI won't tell us whether or not GPT-3.5 is an MoE model, and I assumed it wasn't. Since GPT-3.5 is clearly nondeterministic at temp=0, I believed the nondeterminism was due to FPU stuff, and this effect was amplified with GPT-4's MoE. But if GPT-3.5 is also MoE then that's just wrong.

What makes this especially tricky is that small models are truly 100% deterministic at temp=0 because the relative likelihoods are too coarse for FPU issues to be a factor. I had thought 3.5 was big enough that some of its token probabilities were too fine-grained for the FPU. But that's probably wrong.

On the other hand, it's not just GPT, there are currently floating-point difficulties in vllm which significantly affect the determinism of any model run on it: https://github.com/vllm-project/vllm/issues/966 Note that a suggested fix is upcasting to float32. So it's possible that GPT-3.5 is using an especially low-precision float and introducing nondeterminism by saving money on compute costs.

Sadly I do not have the money[1] to actually run a test to falsify any of this. It seems like this would be a good little research project.

[1] Or the time, or the motivation :) But this stuff is expensive.

link

memhole 781 days ago

I’m so glad to see LLMs spark these conversations lately. It’s been a huge gripe of mine that we don’t question the underlying precision in other areas of AI/ML

link

wongarsu 781 days ago

The last couple of years have been a steady journey of us discovering that in most neural networks precision only matters in a couple key places, and everything else can get away with astonishingly little.

We started out training everything in full (f32) or double precision (f64), then around 2020 everyone switched to half precision (f16) with some stuff in full precision, now we are starting to move to quarter precision, and the newest Nvidia card even supports f4 (eighth precision?). And then of course there's the 1.58bit LLM paper.

So there has been a steady stream of people questioning the underlying precision, and most of the time the answer they came back with was: there's more precision than we need, a larger network with less precision is faster and better than a smaller network with more precision

link

nicklecompte 781 days ago

To be clear there’s a distinction between the quality of the results and the determinism of the results. If a low-precision LLM is wildly stochastic but the variation is mostly linguistic rather than factual or deductive (e.g. coin tosses on synonyms or presenting independent facts in a different order), then there’s not really a contradiction.

AFAIK the determinism side of floating-point precision hasn’t been well-addressed, but it’s been a while since I skimmed those papers.

link

exe34 781 days ago

They can be made to be deterministic on CPU, but not on GPU (unless you want to give up on the speedup). With floating points, things like addition are not associative: a + (b + c) is not the same as (a + b) + c. So on CPU, you can make sure the order is always the same and the result is deterministic. On GPU, the order is not guaranteed, and thus the output is not deterministic.

This is because of the

link