| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by scarmig 1056 days ago
	But why are there discrepancies in the floating point arithmetic? They have errors when approximating the reals, but floating point operations are all well-defined: even if 0.1 + 0.2 != 0.3, it's still always true that 0.1 + 0.2 == 0.1 + 0.2. I figure the issue must be something related to concurrency in a fleet of GPUs during inference, but even then it's not clear to me where the nondeterminism would creep in. Maybe different experts simultaneously work on an inference and the first to respond wins? Switching to models with different quantization depending on load?

1 comments

imagainstit 1056 days ago

Floating point math is not associative: (a + b) + c != a + (b + c)

This leads to different results from accumulating sums in different orderings. Accumulating in different ordering is common in parallel math operations.

link

scarmig 1056 days ago

So I guess here my question is why a GPU would perform accumulations in a nondeterministic way where the non-associativity of FP arithmetic matters. You could require that a + b + c always be evaluated left to right and then you've got determinism, which all things being equal is desirable. Presumably because relaxing that constraint allows for some significant performance benefits, but how? Something like avoiding keeping a buffer of all the weights*activations before summing?

link

imagainstit 1055 days ago

Basically because it affects performance. You really don't want to write any buffers!

This is sort of a deep topic, so it's hard to give a concise answer but as an example: CuBLAS guarantees determinism, but only for the same arch and same library version (because the best performing ordering of operations depends on arch and implementation details) and does not guarantee it when using multiple streams (because the thread scheduling is non-deterministic and can change ordering).

Determinism is something you have to build in from the ground up if you want it. It can cost performance, it won't give you the same results between different architectures, and it's frequently tricky to maintain in the face of common parallel programming patterns.

Consider this explanation from the pytorch docs (particularly the bit on cuda convolutions):

https://pytorch.org/docs/stable/notes/randomness.html

link

SomewhatLikely 1056 days ago

There has been speculation that GPT4 is a mixture of experts model, where each expert could be hosted on a different machine. As those machines may report their results to the aggregating machine in different orders then the results could be summed in different orders.

link

swores 1055 days ago

Maybe my assumption of how MoE would/could work is wrong, but I had assumed that it means getting different models to generate different bits of text, and then stitching them together - for example, you ask it to write a short bit of code where every comment is poetry, the instruction would be split (by a top level "manager" model?) such that one model is given the task "write this code" and another given the task "write a poem that explains what the code does". There therefore wouldn't be maths done that's combining numbers from the different experts, just their outputs (text) being merged.

Have I completely misunderstood, does Mixture of Experts somehow involve the different experts actually collaborating on the raw computation together?

Could anyone share a recommendation for what to read to learn more about MoE generally? (Ideally that's understandable by someone like me that isn't an expert in LLMs/ML/etc.)

link

ossopite 1056 days ago

for performance reasons, yes, I believe it's because the accumulation is over parallel computations so the ordering is at the mercy of the scheduler. but I'm not familiar with the precise details

edit: at 13:42 in https://www.youtube.com/watch?v=TB07_mUMt0U&t=13m42s there is an explanation of the phenomenon in the context of training but I suspect the same kind of operation is happening during inference

link

charcircuit 1056 days ago

His point is that you do not have to rely on associative being true in order to run inference on a LLM.

link