|
|
|
|
|
by scarmig
1018 days ago
|
|
So I guess here my question is why a GPU would perform accumulations in a nondeterministic way where the non-associativity of FP arithmetic matters. You could require that a + b + c always be evaluated left to right and then you've got determinism, which all things being equal is desirable. Presumably because relaxing that constraint allows for some significant performance benefits, but how? Something like avoiding keeping a buffer of all the weights*activations before summing? |
|
This is sort of a deep topic, so it's hard to give a concise answer but as an example: CuBLAS guarantees determinism, but only for the same arch and same library version (because the best performing ordering of operations depends on arch and implementation details) and does not guarantee it when using multiple streams (because the thread scheduling is non-deterministic and can change ordering).
Determinism is something you have to build in from the ground up if you want it. It can cost performance, it won't give you the same results between different architectures, and it's frequently tricky to maintain in the face of common parallel programming patterns.
Consider this explanation from the pytorch docs (particularly the bit on cuda convolutions):
https://pytorch.org/docs/stable/notes/randomness.html