There is a delay from sending any data to/from the GPU.
Generally you aim to give the GPU a 'large' task (or many small tasks), then ask for the answer(s) in 1 batch and wait while it is transferred across.
If you have many small tasks where each answer is sent back separately, and you need that answer before requesting the next task, then you will be very slow, even if the data sent/received is small. A naive implementation of the chess algorithm here (with alpha-beta pruning (which has many if-then branches)) would be like this :/
Generally you aim to give the GPU a 'large' task (or many small tasks), then ask for the answer(s) in 1 batch and wait while it is transferred across.
If you have many small tasks where each answer is sent back separately, and you need that answer before requesting the next task, then you will be very slow, even if the data sent/received is small. A naive implementation of the chess algorithm here (with alpha-beta pruning (which has many if-then branches)) would be like this :/