Hacker News new | ask | show | jobs
by DTolm 2090 days ago
I use the averaged data of 1000 merged launches and then average the end result over a number of runs. Merging FFT calls is actually the way how I use VkFFT in Vulkan Spirit (with some other shaders between), so this benchmark is fairly close to the real life application use case. My benchmark most likely averages out multimodal distribution effects by design.
2 comments

The OCT data we process comes in at about 4GSamples/s and my benchmark is for ~5ms of capture data, in the considered dataset 1D-FFT with a length of 2048 points and a block size of 128. It is not a synthetic benchmark, I'm measuring the real life application behavior here (and to eliminate the runtime behavior effects of the other parts I can flip a flag skipping over the DAQ codepath, working on allocated, but uninitialized buffers).
Small FFTs like 2048 only utilize one SM and the way they are given to the GPU may produce some fluctuations. It also depends on the way your code works. Synchronizations are also more impactful in this case. Do you launch a big grid that consists of multiple samples combined in a matrix or you launch each sample separately?
I'm aware of all of that. And yes, we're very synchronization dependent. However we also spent a lot of time tinkering with the launch parameter and properly interleaving all synchronization events and fences due to our demands on achieving low latency.

Find our original publication here: https://doi.org/10.1364/BOE.5.002963

Since then we improved on that. For the resampling and complex tonemapping we determined empirically that a grid of 128 threads, each processing a whole line achieves the best throughput; there's a 2D parameter space of possible launch configurations and we brute force the whole thing (so far I didn't benchmark the RTX20xx and RTX30xx GPUs, but it was consistent between the GTX690 to GTX1080). The FFT plan is what cufftPlan1d is producing for a single axis transform over a 2D array, usually 2048 point FFT, but with up to 4096 lines (well, technically whatever the maximum dimension for 3D textures is).

> Do you launch a big grid that consists of multiple samples combined in a matrix

Of course!

> or you launch each sample separately?

Of course not, that'd be stupid.

Well, most likely I won't be able to help explaining the fluctuation easily then, as you have spent a lot of time on it already. It would be cool to try VkFFT in this usage scenario at some pont in the future though - it also can do 1D FFTs of grouped in matrix sequences.
As I already mentioned over at https://www.reddit.com/r/vulkan/comments/i2ivzh/new_vulkan_f... I'm going to do that. And will let you know how it goes.
If you happen to need any assistance in refining VkFFT for your use case, feel free to contact me.
Please check out cuFFTDx - you may be able to fuse parts of your pipeline on-chip.
If it's multimodal, then averaging it out is the wrong thing to do. A histogram would be more appropriate to display the different modes.