| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by DTolm 2092 days ago
	GPU is a very consistent device, so the purpose of such big sample sizes and multiple launches with averaging is to reduce all the deviations almost to zero. The error is <1% in this case and showing it on the plot will not really change it. The values, however, change when I update the code and improve it, so this is by no means the final way the benchmark will look like. I will think on how to adress this better in the future, but for now I think the best solution if you doubt the results is to launch VkFFT and see what it outputs for yourself.

1 comments

datenwolf 2092 days ago

> GPU is a very consistent device.

You'd think that, but I found all GPUs I'm using here to exhibit multimodal distribution of execution times in the FFT (this is for the cuFFT codepath). The GTX980 (not shown in the plot) and the Titan-X even have very prominent outliers. This is a figure that's going to be in the paper I'm currently writing:

https://dl.datenwolf.net/gpu_oct_benchmark_plots.pdf

I'm comparing the OCT processing execution times (with HOT caches, mind you) between a Titan-X and a GTX1080. The difference also shows up very prominently when looking at the kernel scheduling order as reported by NVPP.

link

DTolm 2092 days ago

I use the averaged data of 1000 merged launches and then average the end result over a number of runs. Merging FFT calls is actually the way how I use VkFFT in Vulkan Spirit (with some other shaders between), so this benchmark is fairly close to the real life application use case. My benchmark most likely averages out multimodal distribution effects by design.

link

datenwolf 2092 days ago

The OCT data we process comes in at about 4GSamples/s and my benchmark is for ~5ms of capture data, in the considered dataset 1D-FFT with a length of 2048 points and a block size of 128. It is not a synthetic benchmark, I'm measuring the real life application behavior here (and to eliminate the runtime behavior effects of the other parts I can flip a flag skipping over the DAQ codepath, working on allocated, but uninitialized buffers).

link

DTolm 2092 days ago

Small FFTs like 2048 only utilize one SM and the way they are given to the GPU may produce some fluctuations. It also depends on the way your code works. Synchronizations are also more impactful in this case. Do you launch a big grid that consists of multiple samples combined in a matrix or you launch each sample separately?

link

datenwolf 2092 days ago

I'm aware of all of that. And yes, we're very synchronization dependent. However we also spent a lot of time tinkering with the launch parameter and properly interleaving all synchronization events and fences due to our demands on achieving low latency.

Find our original publication here: https://doi.org/10.1364/BOE.5.002963

Since then we improved on that. For the resampling and complex tonemapping we determined empirically that a grid of 128 threads, each processing a whole line achieves the best throughput; there's a 2D parameter space of possible launch configurations and we brute force the whole thing (so far I didn't benchmark the RTX20xx and RTX30xx GPUs, but it was consistent between the GTX690 to GTX1080). The FFT plan is what cufftPlan1d is producing for a single axis transform over a 2D array, usually 2048 point FFT, but with up to 4096 lines (well, technically whatever the maximum dimension for 3D textures is).

> Do you launch a big grid that consists of multiple samples combined in a matrix

Of course!

> or you launch each sample separately?

Of course not, that'd be stupid.

link

DTolm 2092 days ago

Well, most likely I won't be able to help explaining the fluctuation easily then, as you have spent a lot of time on it already. It would be cool to try VkFFT in this usage scenario at some pont in the future though - it also can do 1D FFTs of grouped in matrix sequences.

link

llukas 2092 days ago

Please check out cuFFTDx - you may be able to fuse parts of your pipeline on-chip.

link

eximius 2092 days ago

If it's multimodal, then averaging it out is the wrong thing to do. A histogram would be more appropriate to display the different modes.

link