Hacker News new | ask | show | jobs
by jkaptur 155 days ago
(I'm not an expert. I'd love to be corrected by someone who actually knows.)

Floating-point arithmetic is not associative. (A+B)+C does not necessarily equal A+(B+C), but you can get a performance improvement by calculating A, B, and C in parallel, then adding together whichever two finish first. So, in theory, transformers can be deterministic, but in a real system they almost always aren't.

4 comments

Not an expert either, but my understanding is that large models use quantized weights and tensor inputs for inference. Multiplication and addition of fixed-point values is associative, so unless there's an intermediate "convert to/from IEEE float" step (activation functions, maybe?), you can still build determinism into a performant model.
Fixed point arithmetic isn't truly associative unless they have infinite precision. The second you hit a limit or saturate/clamp a value the result very much depends on order of operations.
Ah yes, I forgot about saturating arithmetic. But even for that, you wouldn't need infinite precision for all values, you'd only need "enough" precision for the intermediate values, right? E.g. for an inner product of two N-element vectors containing M-bit integers, an accumulator with at least ceil(log2(N))+2*M bits would guarantee no overflow.
True, you can increase bit width to guarantee never hit those issues, but right now saturating arithmetic on types that pretty commonly hit those values is the standard. Guaranteeing it would be a significant performance drop and/or memory use increase with current techniques to the level it would significantly affect availability and cost compared to what people expect.

Similarly you could not allow re-ordering of operations and similar - so the results are guaranteed to be deterministic (even if still "not correct" compared to infinite precision arithmetic) - but that would also have a big performance cost.

> you can get a performance improvement by calculating A, B, and C in parallel, then adding together whichever two finish first

Technically possible, but I think unlikely to happen in practice.

On the higher level, these large models are sequential and there’s nothing to parallelize. The inference is a continuous chain of data dependencies between temporary tensors which makes it impossible to compute different steps in parallel.

On the lower level, each step is a computationally expensive operation on a large tensor/matrix. These tensors are often millions of numbers, the problem is very parallelizable, and the tactics to do that efficiently are well researched because matrix linear algebra is in wide use for decades. However, it’s both complicated and slow to implement fine grained parallelism like “adding together whichever two finish first” on modern GPUs. Just too much synchronization, when total count of active threads is many thousands, too expensive. Instead, operations like matrix multiplications are often assigning 1 thread per output element or fixed count of output elements, and reduction like softmax or vector dot product are using a series of exponentially decreasing reduction steps, i.e. order is deterministic.

However, that order may change with even minor update of any parts of the software, including opaque pieces at the low level like GPU drivers and firmware. Library developers are updating GPU kernels, drivers, firmware and OS kernels collectively implementing scheduler which assigns work to cores, both may affect order of these arithmetic operations.

I don't think the order of operations is non-deterministic between different runs. That would make programming and researching these systems more difficult than necessary.
It would be if you used atomics.
I said: don't think it's non-deterministic, two negations -> deterministic.
It’s usually not too difficult or expensive to avoid doing this.