Hacker News new | ask | show | jobs
by querez 2090 days ago
"VkFFT aims to provide community with an open-source alternative to Nvidia's cuFFT library, while achieving better performance."

There are no error bars on the graphs, so it's very hard to judge if the minor differences are significant. I work in research, so probably I'm peculiar about this point, but: I'd expect better from anyone who's taken basic statistics. But from a quick look, it seems like the performance is pretty much just "on par".

It would also be nice to know how performance is on other hardware. I'm assuming it's tuned to nvidida GPUs (or maybe even the specific GPU mentioned). But how does this perform on Intel or AMD hardware? How does it compare to `rocFFT` or Intel's own implementation?

1 comments

The FFT and iFFT are performed consecutively up to 1000 times and then each run is done 5 more times. The total result is averaged both for VkFFT and cuFFT and stays roughly the same between launches. The minor performance gains (5-20%) are noticeable. If you have a better testing technique, I am open to the suggestions.

I have tested VkFFT on Intel UHD620 GPU and the performance scaled on the same rate as most benchmarks do. There are a couple of parameters that can be modified for different GPUs (like the amount of memory coalesced, which is 32bits on Nvidia GPUs after Pascal and is 64bits for Intel). I have no access to an AMD machine, otherwise I would have refined the lauch configuration parameters for it too. I have not tested other libraries than cuFFT yet.

Thanks for the further clarification! If you ran this several times, you could calculate standard deviations or confidence intervals. It would be nice if you could report one such measure, so it's clearer that the differences are not just some random fluctuations. E.g. you could include them as error bars in your plots. You could also run a statistical test (in this case, a t-test is very easy to do) and report the p-value. Those are the things I'd expect my students to do if they'd have to do something like this for a report or a project, because it's the only way for people to judge if differences show clear signal or are just random fluctuations due to measurement noise.

Also: I should've said this in my first post already, which in hindsight might sound too negative: I think this is a cool project and you did a great job! I just thought this might improve the presentation of your results a bit.

GPU is a very consistent device, so the purpose of such big sample sizes and multiple launches with averaging is to reduce all the deviations almost to zero. The error is <1% in this case and showing it on the plot will not really change it. The values, however, change when I update the code and improve it, so this is by no means the final way the benchmark will look like. I will think on how to adress this better in the future, but for now I think the best solution if you doubt the results is to launch VkFFT and see what it outputs for yourself.
> GPU is a very consistent device.

You'd think that, but I found all GPUs I'm using here to exhibit multimodal distribution of execution times in the FFT (this is for the cuFFT codepath). The GTX980 (not shown in the plot) and the Titan-X even have very prominent outliers. This is a figure that's going to be in the paper I'm currently writing:

https://dl.datenwolf.net/gpu_oct_benchmark_plots.pdf

I'm comparing the OCT processing execution times (with HOT caches, mind you) between a Titan-X and a GTX1080. The difference also shows up very prominently when looking at the kernel scheduling order as reported by NVPP.

I use the averaged data of 1000 merged launches and then average the end result over a number of runs. Merging FFT calls is actually the way how I use VkFFT in Vulkan Spirit (with some other shaders between), so this benchmark is fairly close to the real life application use case. My benchmark most likely averages out multimodal distribution effects by design.
The OCT data we process comes in at about 4GSamples/s and my benchmark is for ~5ms of capture data, in the considered dataset 1D-FFT with a length of 2048 points and a block size of 128. It is not a synthetic benchmark, I'm measuring the real life application behavior here (and to eliminate the runtime behavior effects of the other parts I can flip a flag skipping over the DAQ codepath, working on allocated, but uninitialized buffers).
If it's multimodal, then averaging it out is the wrong thing to do. A histogram would be more appropriate to display the different modes.