Hacker News new | ask | show | jobs
by stagger87 2041 days ago
Do I understand your benchmark plots correctly?

Using the single precision at 1k FFT size as my example.

~165,000 kB/ms performance

Converts to 165,000 MB/s performance

Divide by 8 to convert to complex samples, so 20,625 M complex samples per second.

Divide by 1k to get FFT count of ~20.14M FFT/IFFTs per second?

These benchmarks also include transfer time to and from the GPU?

1 comments

1k FFT size in single precision is 1024 x 2 x sizeof(float) = 8KB. If we don't think that it won't utilize full GPU (not even one compute unit) and assume that it scales similarly to big systems then: 1)165GB/s is an algorithmic bandwidth of benchmark, including consecutive FFT+iFFT. Both of them take one upload and one download from chip - total 4 memory transfers. The real bandwidth for this value will be 4*165=660GB/s. 2)one FFT is 2 transfers - upload and download. Total 16KB. 3)660GB/s / 16KB = 43M iterations per second. Similar to your number, but your number didn't account that benchmark has 4 uploads instead of 2.

These benchmarks don't include transfers to and from GPU, as those are done with PCI-E bandwidth (30GB/s) which is really slow compared to VRAM-chip bandwidth (>500GB/s). This is why it is important to have enough VRAM and avoid CPU communications as much as possible.

My son hit that PCIe limit going the other direction, trying to get FFTW data on to the screen. The project:

https://github.com/kevinacahalan/piano_waterfall

With his motherboard it was impossible to keep both FFT views scrolling at full speed if they were large. He ended up creating a circular buffer in video memory so that he would be able to reduce the PCIe traffic to just the fresh new edge of the data. The fix doesn't work everywhere. Virtualization seems to break it, including with a Chromebook.

Is VkFFT a reasonable tool for attacking this problem? How difficult might it be to get the FFT result into the needed color component, gamma corrected and scaled, with all the other components?

In VkFFT FFT is done as a part of the so called compute queue. There is no need to transfer data through PCI to show it on screen - there is a graphics queue that has this as its main purpose. You can use compute queue results in it and do all necessary calculations directly on the GPU. This type of task can benifit greatly from doing FFT on GPU. VkFFT can be append FFT to the user command buffer (list of commands), which can be then followed by additional user commands, like scaling, correcting, etc.

AFAIK, virtualization software right now has no or very limited access to the GPU, so no extensive graphics can be done through it.

Thanks for the reply! I apologize, I could have looked up the PCIE bandwidth and answered part of my question, it didn't occur to me.

Do you know an application (other than 2-D/3D transforms) where someone would move data from CPU->VRAM then perform a whole bunch of 1-D FFTs on that memory? If any given complex value is only part of only 1-2 FFTs, then PCI-E bandwidth is the limiting factor.

As just a curiosity, why don't FFT benchmarks (CPU/GPU/otherwise) ever simply plot FFTs per second on the y-axis? That's the number we all care about right? It's always bandwidth (with some nuance as you described) or worse, MFLOPs with some arbitrary scaling factor. Fine for relative performance, but if your not comparing it to my platform, its not that useful unless I measure my platform and convert to the same representation.

Big 1D FFTs also take a lot of memory by themselves (i.e. 2^28 takes 2GB just to store complex data). Multiple smaller batches can be used in ML applications for example for big kernel convolutions. All learning can actually be done without transferring data to CPU. In computational physics iterative processes or PDE integrators can do their algorithms independently of the CPU.

About using simple time per iteration as a benchmark. I used to have this type of benchmark before the last update (see: https://raw.githubusercontent.com/DTolm/VkFFT/f7c8c45717006c...). As you can see, it is not really that informative, as you can't really compare smaller times to big ones here. The layout I use now doesn't make any assumptions about algorithm yet still provides very informative scaling - just by looking at it it is possible to jusge wether techinques like register overutilization actually work. Another important thing is that it can clearly prove that the problem is bandwidth bound and compare at which size VkFFT/cuFFT swwitch from 1 to 2 and then to 3 stage FFT algorithm. It also allows to detect wether algorithm deviates from predicted result.