Hacker News new | ask | show | jobs
by datenwolf 2090 days ago
> GPU is a very consistent device.

You'd think that, but I found all GPUs I'm using here to exhibit multimodal distribution of execution times in the FFT (this is for the cuFFT codepath). The GTX980 (not shown in the plot) and the Titan-X even have very prominent outliers. This is a figure that's going to be in the paper I'm currently writing:

https://dl.datenwolf.net/gpu_oct_benchmark_plots.pdf

I'm comparing the OCT processing execution times (with HOT caches, mind you) between a Titan-X and a GTX1080. The difference also shows up very prominently when looking at the kernel scheduling order as reported by NVPP.

1 comments

I use the averaged data of 1000 merged launches and then average the end result over a number of runs. Merging FFT calls is actually the way how I use VkFFT in Vulkan Spirit (with some other shaders between), so this benchmark is fairly close to the real life application use case. My benchmark most likely averages out multimodal distribution effects by design.
The OCT data we process comes in at about 4GSamples/s and my benchmark is for ~5ms of capture data, in the considered dataset 1D-FFT with a length of 2048 points and a block size of 128. It is not a synthetic benchmark, I'm measuring the real life application behavior here (and to eliminate the runtime behavior effects of the other parts I can flip a flag skipping over the DAQ codepath, working on allocated, but uninitialized buffers).
Small FFTs like 2048 only utilize one SM and the way they are given to the GPU may produce some fluctuations. It also depends on the way your code works. Synchronizations are also more impactful in this case. Do you launch a big grid that consists of multiple samples combined in a matrix or you launch each sample separately?
I'm aware of all of that. And yes, we're very synchronization dependent. However we also spent a lot of time tinkering with the launch parameter and properly interleaving all synchronization events and fences due to our demands on achieving low latency.

Find our original publication here: https://doi.org/10.1364/BOE.5.002963

Since then we improved on that. For the resampling and complex tonemapping we determined empirically that a grid of 128 threads, each processing a whole line achieves the best throughput; there's a 2D parameter space of possible launch configurations and we brute force the whole thing (so far I didn't benchmark the RTX20xx and RTX30xx GPUs, but it was consistent between the GTX690 to GTX1080). The FFT plan is what cufftPlan1d is producing for a single axis transform over a 2D array, usually 2048 point FFT, but with up to 4096 lines (well, technically whatever the maximum dimension for 3D textures is).

> Do you launch a big grid that consists of multiple samples combined in a matrix

Of course!

> or you launch each sample separately?

Of course not, that'd be stupid.

Well, most likely I won't be able to help explaining the fluctuation easily then, as you have spent a lot of time on it already. It would be cool to try VkFFT in this usage scenario at some pont in the future though - it also can do 1D FFTs of grouped in matrix sequences.
As I already mentioned over at https://www.reddit.com/r/vulkan/comments/i2ivzh/new_vulkan_f... I'm going to do that. And will let you know how it goes.
Please check out cuFFTDx - you may be able to fuse parts of your pipeline on-chip.
If it's multimodal, then averaging it out is the wrong thing to do. A histogram would be more appropriate to display the different modes.