Hacker News new | ask | show | jobs
by stagger87 2041 days ago
Thanks for the reply! I apologize, I could have looked up the PCIE bandwidth and answered part of my question, it didn't occur to me.

Do you know an application (other than 2-D/3D transforms) where someone would move data from CPU->VRAM then perform a whole bunch of 1-D FFTs on that memory? If any given complex value is only part of only 1-2 FFTs, then PCI-E bandwidth is the limiting factor.

As just a curiosity, why don't FFT benchmarks (CPU/GPU/otherwise) ever simply plot FFTs per second on the y-axis? That's the number we all care about right? It's always bandwidth (with some nuance as you described) or worse, MFLOPs with some arbitrary scaling factor. Fine for relative performance, but if your not comparing it to my platform, its not that useful unless I measure my platform and convert to the same representation.

1 comments

Big 1D FFTs also take a lot of memory by themselves (i.e. 2^28 takes 2GB just to store complex data). Multiple smaller batches can be used in ML applications for example for big kernel convolutions. All learning can actually be done without transferring data to CPU. In computational physics iterative processes or PDE integrators can do their algorithms independently of the CPU.

About using simple time per iteration as a benchmark. I used to have this type of benchmark before the last update (see: https://raw.githubusercontent.com/DTolm/VkFFT/f7c8c45717006c...). As you can see, it is not really that informative, as you can't really compare smaller times to big ones here. The layout I use now doesn't make any assumptions about algorithm yet still provides very informative scaling - just by looking at it it is possible to jusge wether techinques like register overutilization actually work. Another important thing is that it can clearly prove that the problem is bandwidth bound and compare at which size VkFFT/cuFFT swwitch from 1 to 2 and then to 3 stage FFT algorithm. It also allows to detect wether algorithm deviates from predicted result.