Hacker News new | ask | show | jobs
by cpgxiii 1042 days ago
Except the issue is inextricably linked to GPUs. All of the work in practical DNNs exists because of the extreme parallel performance available from GPUs, and that performance is only possible with non-deterministic threading. You can't get reasonable training and inference time on existing hardware without it.
2 comments

1000 threads can run in parallel. It doesn't prevent us to sum their results deterministically:

    results = ThreadPool(workers=1000).imap_unordered(calc, inputs)
    print(math.fsum(results))
Due to the magic of the fsum alg, the result is deterministic whatever order we get results in. https://docs.python.org/3/library/math.html#math.fsum
That's not the operation being performed on GPUs that is the problem. The issue is that fundamentally GPUs allow for high performance operations using atomics, but this comes at the cost of nondeterministic results. You can get deterministic results but doing so comes with a significant performance costs.
Using atomics is easier than warp operations (using warp shuffle for example), but warp shuffle is quite fast.

I guess if determinism is so important implementations can be changed, it is just maybe not that high priority.

That summation is slow and would not be used in practice.

You could use just one thread on your 10000 thread GPU too and it would be deterministic, sure. Completely beside the point.

In my experience cuBLAS is deterministic, since matmul is the most intensive part I don‘t see other reasons for non-determinism other than sloppyness (at least when just a single GPU is involved)
Yeah. In curated transformers [1] we are seeing completely deterministic output across multiple popular transformer architectures on a single GPU (there can be variance between GPUs due to different kernels). Of course, it completely depends on what ops and implementations you are using. But most transformers do not use ops that are typically non-deterministic to be fast (like scatter-add).

One non-determinism we see with a temperature of 0 is that once you have quantized weights, many predicted pieces will have the same probability, including multiple pieces with the highest probability. And then the sampler (if you are not using a greedy decoder) will sample from those pieces. So, generation is non-deterministic with a temperature of 0.

In other words, a temperature of 0 is a poor man’s greedy decoding. (It is totally possible that OpenAI’s implementation switches to a greedy decoder with a temperature of 0).

[1] https://github.com/explosion/curated-transformers