| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sunrunner 220 days ago
	Ignoring floating point errors, assuming a perfectly spherical cow, and taking air resistance as zero.

1 comments

AJRF 220 days ago

Imagine you are predicting the next token, you have two tokens very close in probability in the distribution, kernel execution is not deterministic because of floating point non-associativity - the token that gets predicted impacts the tokens later in the prediction stream - so it's very consequential which one gets picked.

This isn't some hypothetical - it happens all the time with LLM's - it isn't some freak accident that isn't probable

link

randomgermanguy 220 days ago

Okay yes, but would you really say that the main part of non-determinism in LLM-usage stems from this ? No its obviously the topk sampling.

I don't think my tech-lead was trying to suggest the floating-point error/non-associativity was the real source.

link

AJRF 220 days ago

> Would you really say that the main part of non-determinism in LLM-usage stems from this

Yes I would because it causes exponential divergence (P(correct) = (1-e)^n) and doesn't have a widely adopted solution. The major labs have very expensive researchers focused on this specific problem.

There is a paper from Thinking Machines from September around Batch Invariant kernels you should read, it's a good primer on this issue of non-determinism in LLM's, you might learn something from it!

Unfortunately the method has quite a lot of overhead, but promising research all the same.

link

randomgermanguy 220 days ago

Alright fair enough.

I dont think this is relevant to the main-point, but it's definitely something I wasn't aware of. I would've thought it might have an impact on like O(100)th token in some negligible way, but glad to learn.

link