|
|
|
|
|
by Dylan16807
501 days ago
|
|
Each thread on a CPU will go in the same order. Why would the reduction step of a single neuron be split across multiple threads? That sounds slower and more complex than the naive method. And if you do decide to write code doing that, then just the code that reduces across multiple blocks needs to use integers, so pretty much no extra effort is needed. Like, is there a nondeterministic-dot-product instruction baked into the GPU at a low level? |
|
Not unless you control the underlying scheduler and force deterministic order; knowledge of all the code running isn't sufficient, as some factors affecting threading order are correlated with physical environment. For example, minute temperature gradient differences on the chip between two runs could affect how threads are allocated to CPU cores and order in which they finish.
> Why would the reduction step of a single neuron be split across multiple threads?
Doesn't have to, but can, depending on how many inputs it has. Being able to assume commutativity gives you a lot of flexibility in how you parallelize it, and allows you to minimize overhead (both in throughput and memory requirements).
> Like, is there a nondeterministic-dot-product instruction baked into the GPU at a low level?
No. There's just no dot-product instruction baked into GPU at low level that could handle vectors of arbitrary length. You need to write a loop, and that usually becomes some kind of parallel reduce.