| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cpgxiii 1090 days ago
	Except the issue is inextricably linked to GPUs. All of the work in practical DNNs exists because of the extreme parallel performance available from GPUs, and that performance is only possible with non-deterministic threading. You can't get reasonable training and inference time on existing hardware without it.

2 comments

d0mine 1090 days ago

1000 threads can run in parallel. It doesn't prevent us to sum their results deterministically:

    results = ThreadPool(workers=1000).imap_unordered(calc, inputs)
    print(math.fsum(results))

Due to the magic of the fsum alg, the result is deterministic whatever order we get results in. https://docs.python.org/3/library/math.html#math.fsum

link

cpgxiii 1089 days ago

That's not the operation being performed on GPUs that is the problem. The issue is that fundamentally GPUs allow for high performance operations using atomics, but this comes at the cost of nondeterministic results. You can get deterministic results but doing so comes with a significant performance costs.

link

xiphias2 1089 days ago

Using atomics is easier than warp operations (using warp shuffle for example), but warp shuffle is quite fast.

I guess if determinism is so important implementations can be changed, it is just maybe not that high priority.

link

WithinReason 1089 days ago

That summation is slow and would not be used in practice.

You could use just one thread on your 10000 thread GPU too and it would be deterministic, sure. Completely beside the point.

link

WanderPanda 1089 days ago

In my experience cuBLAS is deterministic, since matmul is the most intensive part I don‘t see other reasons for non-determinism other than sloppyness (at least when just a single GPU is involved)

link

microtonal 1089 days ago

Yeah. In curated transformers [1] we are seeing completely deterministic output across multiple popular transformer architectures on a single GPU (there can be variance between GPUs due to different kernels). Of course, it completely depends on what ops and implementations you are using. But most transformers do not use ops that are typically non-deterministic to be fast (like scatter-add).

One non-determinism we see with a temperature of 0 is that once you have quantized weights, many predicted pieces will have the same probability, including multiple pieces with the highest probability. And then the sampler (if you are not using a greedy decoder) will sample from those pieces. So, generation is non-deterministic with a temperature of 0.

In other words, a temperature of 0 is a poor man’s greedy decoding. (It is totally possible that OpenAI’s implementation switches to a greedy decoder with a temperature of 0).

[1] https://github.com/explosion/curated-transformers

link