Hacker News new | ask | show | jobs
by DTolm 2090 days ago
The FFT and iFFT are performed consecutively up to 1000 times and then each run is done 5 more times. The total result is averaged both for VkFFT and cuFFT and stays roughly the same between launches. The minor performance gains (5-20%) are noticeable. If you have a better testing technique, I am open to the suggestions.

I have tested VkFFT on Intel UHD620 GPU and the performance scaled on the same rate as most benchmarks do. There are a couple of parameters that can be modified for different GPUs (like the amount of memory coalesced, which is 32bits on Nvidia GPUs after Pascal and is 64bits for Intel). I have no access to an AMD machine, otherwise I would have refined the lauch configuration parameters for it too. I have not tested other libraries than cuFFT yet.

1 comments

Thanks for the further clarification! If you ran this several times, you could calculate standard deviations or confidence intervals. It would be nice if you could report one such measure, so it's clearer that the differences are not just some random fluctuations. E.g. you could include them as error bars in your plots. You could also run a statistical test (in this case, a t-test is very easy to do) and report the p-value. Those are the things I'd expect my students to do if they'd have to do something like this for a report or a project, because it's the only way for people to judge if differences show clear signal or are just random fluctuations due to measurement noise.

Also: I should've said this in my first post already, which in hindsight might sound too negative: I think this is a cool project and you did a great job! I just thought this might improve the presentation of your results a bit.

GPU is a very consistent device, so the purpose of such big sample sizes and multiple launches with averaging is to reduce all the deviations almost to zero. The error is <1% in this case and showing it on the plot will not really change it. The values, however, change when I update the code and improve it, so this is by no means the final way the benchmark will look like. I will think on how to adress this better in the future, but for now I think the best solution if you doubt the results is to launch VkFFT and see what it outputs for yourself.
> GPU is a very consistent device.

You'd think that, but I found all GPUs I'm using here to exhibit multimodal distribution of execution times in the FFT (this is for the cuFFT codepath). The GTX980 (not shown in the plot) and the Titan-X even have very prominent outliers. This is a figure that's going to be in the paper I'm currently writing:

https://dl.datenwolf.net/gpu_oct_benchmark_plots.pdf

I'm comparing the OCT processing execution times (with HOT caches, mind you) between a Titan-X and a GTX1080. The difference also shows up very prominently when looking at the kernel scheduling order as reported by NVPP.

I use the averaged data of 1000 merged launches and then average the end result over a number of runs. Merging FFT calls is actually the way how I use VkFFT in Vulkan Spirit (with some other shaders between), so this benchmark is fairly close to the real life application use case. My benchmark most likely averages out multimodal distribution effects by design.
The OCT data we process comes in at about 4GSamples/s and my benchmark is for ~5ms of capture data, in the considered dataset 1D-FFT with a length of 2048 points and a block size of 128. It is not a synthetic benchmark, I'm measuring the real life application behavior here (and to eliminate the runtime behavior effects of the other parts I can flip a flag skipping over the DAQ codepath, working on allocated, but uninitialized buffers).
Small FFTs like 2048 only utilize one SM and the way they are given to the GPU may produce some fluctuations. It also depends on the way your code works. Synchronizations are also more impactful in this case. Do you launch a big grid that consists of multiple samples combined in a matrix or you launch each sample separately?
Please check out cuFFTDx - you may be able to fuse parts of your pipeline on-chip.
If it's multimodal, then averaging it out is the wrong thing to do. A histogram would be more appropriate to display the different modes.