One thing I don’t understand: if latency is important for this use case, why isn’t the CPU busy preparing the next GPU ‘job’ while a GPU ‘job’ is running?
I attempted to preempt your question in the section of my blog post, "Why don’t you just pipeline the GPU code so that it saturates the GPU?" It's one of the less-detailed sections though so maybe you have further questions? I think the main thing is that since Anukari processes input like MIDI and audio data in real-time, it can't work ahead of the CPU, because those inputs are not available yet.
Possibly what you describe is a bit more like double-buffering, which I also explored. The problem here is latency: any form of N-buffering introduces additional latency. This is one reason why some gamers don't like triple-buffering for graphics, because it introduces further latency between their mouse inputs and the visual change.
But furthermore, when the GPU clock rate is too low, double-buffering or pipelining don't help anyway, because fundamentally Anukari has to keep up with real time, and every block it processes is dependent on the previous one. With a fully-lowered GPU clock, the issue does actually become one of throughput and not just latency.
That's pipelining and it's good for throughput but it sacrifices latency. Audio is not a continuous bit stream but a series of small packets. To begin working on the next one on the CPU while the previous one is on the GPU requires 2 samples in flight which necessarily means higher latency
I don’t see that. If the CPU part starts processing packet #2 while the GPU processes packet #1, not after it has done so, it will have the data that has to be sent to the GPU for packet #2 ready earlier, so it can send it earlier, potentially the moment the GPU has finished processing packet #1 (if the GPU is powerful enough, possibly even before that)
That’s why I asked about the plug-in APIs. They may have to be async, with functions not returning when they’re fully done processing a ‘packet’ but as soon as they can accept more data, which may be earlier.
But in general no, you can't begin processing a buffer before finishing the previous buffer because the processing is stateful and you would introduce a data race. And you can't synchronize the state with something simple like a lock, because locking the audio playback is forbidden in real time.
You can buffer ahead of time, this introduces latency. You can't do things ahead of time without introducing delay, because of causality - you can't start processing packet #2 while packet #1 is in flight because packet #2 hasn't happened yet.
To make it a bit more clear why you can't do this without more latency:
Under the hood there is an audio device that reads/writes from a buffer at a fixed interval of time, call that N (number of samples, multiply by sample rate to get in seconds). When that interval is up, the driver swaps the buffer for a new one of the same size. The OS now has exactly (N samples * sample_rate) to fill the buffer before its swapped back with the device driver.
The kernel maps or copies the buffer into virtual memory, wake the user space process, call a function to fill the buffer, and return back to kernel space to commit it back to the driver. The buffer you read/write from your process is packet #1. Packet #2 doesn't arrive until the interval ticks again and buffers are exchanged.
Now say that processing packet #1 takes longer than N samples or needs at least M samples of data to do its work and M > N. What you do is copy your N samples of packet #1 into a temporary buffer, what until M samples have been acquired to do your work, but concurrently read out of your internal buffer delayed by M - N samples. You've successfully done more work, but delayed the stream by the difference.
You're requiring that packet #2 be available before packet #1 has finished. That's higher latency than the goal, which is packet #1 is processed & sent to output before packet #2 has arrived at all.
Or perhaps you're missing that there's an in event as part of this, like a MIDI instrument? It's an in->effect->out sequence. So minimizing latency means that the "effect" part must be as small as possible, which means it's desired for it to happen faster than "in" can feed it data
this might trick the heuristics in the right direction ie. feed the GPU a bunch of small tasks (i.e. with a small number of samples) instead of big tasks.
I mean the CPU can't prepare a job for samples which don't exist yet. If it takes 0.5 milliseconds to process 1 millisecond's worth of audio, you'll necessarily be stopping and starting constantly. You can't keep the GPU fed continuously.
Possibly what you describe is a bit more like double-buffering, which I also explored. The problem here is latency: any form of N-buffering introduces additional latency. This is one reason why some gamers don't like triple-buffering for graphics, because it introduces further latency between their mouse inputs and the visual change.
But furthermore, when the GPU clock rate is too low, double-buffering or pipelining don't help anyway, because fundamentally Anukari has to keep up with real time, and every block it processes is dependent on the previous one. With a fully-lowered GPU clock, the issue does actually become one of throughput and not just latency.