Hacker News new | ask | show | jobs
by DTolm 2042 days ago
Hello! Since the last post VkFFT has experienced a number of huge improvements and optimizations. Namely:

-It now supports sequences up to 2^32 in all dimensions (algorithmically, in reality limited to allocatable memory size, switch to 64-bit addressing scheme is planned for future release)

-configurations optimized for bigger range of systems and vendors

-benchmarked Radeon VII and RTX 3080, shows that FFT is extremely bandwidth limited on modern GPUs

-VkFFT is able to match and outperform cuFFT on the whole tested range from 2^7 to 2^28 in single precision

-added double and half precision support and precision tests against FFTW on CPU

-improved native zeropadding - up to 3x performance boost

-switched license to MPL 2.0

Thanks for your attention! I am happy to answer any questions.

5 comments

> VkFFT is able to match and outperform cuFFT on the whole tested range from 2^7 to 2^28 in single precision

What is your explanation for this?

Is the VkFFT algorithm better? Is SPIR-V fundamentally more expressive than PTX? Are nVidia drivers better at compiling SPIR-V than PTX?

Have you compared the generated GPU assembly from both?

FFT is an extremely bandwidth limited problem, so if most time is taken by one upload by both algorithms, the overall time will be similar. More in-depth analysis of how VkFFT and cuFFT scales with memory clocks and bandwidth can be found here: https://www.reddit.com/r/nvidia/comments/jxlbjs/rtx_3090_ove...

I don't know exactly what cuFFT does differently, but I am fairly certain they use very similar memory layout and algorithms behind their code (judging by execution times only).

What should be the main take from this is that Vulkan allows for similar in performance low-level memory control, while being cross platform and open source. I don't think that SPIR-V is more expressive - bet Nvidia wouldn't allow this. But it doesn't prohibit it from still being good.

Do I understand your benchmark plots correctly?

Using the single precision at 1k FFT size as my example.

~165,000 kB/ms performance

Converts to 165,000 MB/s performance

Divide by 8 to convert to complex samples, so 20,625 M complex samples per second.

Divide by 1k to get FFT count of ~20.14M FFT/IFFTs per second?

These benchmarks also include transfer time to and from the GPU?

1k FFT size in single precision is 1024 x 2 x sizeof(float) = 8KB. If we don't think that it won't utilize full GPU (not even one compute unit) and assume that it scales similarly to big systems then: 1)165GB/s is an algorithmic bandwidth of benchmark, including consecutive FFT+iFFT. Both of them take one upload and one download from chip - total 4 memory transfers. The real bandwidth for this value will be 4*165=660GB/s. 2)one FFT is 2 transfers - upload and download. Total 16KB. 3)660GB/s / 16KB = 43M iterations per second. Similar to your number, but your number didn't account that benchmark has 4 uploads instead of 2.

These benchmarks don't include transfers to and from GPU, as those are done with PCI-E bandwidth (30GB/s) which is really slow compared to VRAM-chip bandwidth (>500GB/s). This is why it is important to have enough VRAM and avoid CPU communications as much as possible.

My son hit that PCIe limit going the other direction, trying to get FFTW data on to the screen. The project:

https://github.com/kevinacahalan/piano_waterfall

With his motherboard it was impossible to keep both FFT views scrolling at full speed if they were large. He ended up creating a circular buffer in video memory so that he would be able to reduce the PCIe traffic to just the fresh new edge of the data. The fix doesn't work everywhere. Virtualization seems to break it, including with a Chromebook.

Is VkFFT a reasonable tool for attacking this problem? How difficult might it be to get the FFT result into the needed color component, gamma corrected and scaled, with all the other components?

In VkFFT FFT is done as a part of the so called compute queue. There is no need to transfer data through PCI to show it on screen - there is a graphics queue that has this as its main purpose. You can use compute queue results in it and do all necessary calculations directly on the GPU. This type of task can benifit greatly from doing FFT on GPU. VkFFT can be append FFT to the user command buffer (list of commands), which can be then followed by additional user commands, like scaling, correcting, etc.

AFAIK, virtualization software right now has no or very limited access to the GPU, so no extensive graphics can be done through it.

Thanks for the reply! I apologize, I could have looked up the PCIE bandwidth and answered part of my question, it didn't occur to me.

Do you know an application (other than 2-D/3D transforms) where someone would move data from CPU->VRAM then perform a whole bunch of 1-D FFTs on that memory? If any given complex value is only part of only 1-2 FFTs, then PCI-E bandwidth is the limiting factor.

As just a curiosity, why don't FFT benchmarks (CPU/GPU/otherwise) ever simply plot FFTs per second on the y-axis? That's the number we all care about right? It's always bandwidth (with some nuance as you described) or worse, MFLOPs with some arbitrary scaling factor. Fine for relative performance, but if your not comparing it to my platform, its not that useful unless I measure my platform and convert to the same representation.

Big 1D FFTs also take a lot of memory by themselves (i.e. 2^28 takes 2GB just to store complex data). Multiple smaller batches can be used in ML applications for example for big kernel convolutions. All learning can actually be done without transferring data to CPU. In computational physics iterative processes or PDE integrators can do their algorithms independently of the CPU.

About using simple time per iteration as a benchmark. I used to have this type of benchmark before the last update (see: https://raw.githubusercontent.com/DTolm/VkFFT/f7c8c45717006c...). As you can see, it is not really that informative, as you can't really compare smaller times to big ones here. The layout I use now doesn't make any assumptions about algorithm yet still provides very informative scaling - just by looking at it it is possible to jusge wether techinques like register overutilization actually work. Another important thing is that it can clearly prove that the problem is bandwidth bound and compare at which size VkFFT/cuFFT swwitch from 1 to 2 and then to 3 stage FFT algorithm. It also allows to detect wether algorithm deviates from predicted result.

> -benchmarked Radeon VII and RTX 3080, shows that FFT is extremely bandwidth limited on modern GPUs

Great to see that!

I expect huge improvements in that area with AMD's new RX series with SAM activated [0].

[0]: https://www.amd.com/en/technologies/smart-access-memory

Actually, it is still best to aim at zero transfers between GPU and CPU during the execution. The GPU is limited by VRAM-chip bandwidth which is much bigger than the PCI-E bandwidth. And it should not be affected by SAM.
Any plans for arbitrary-size transforms? (i.e., not restricted to vectors whose dimension is a power of two)
Yes, this is indeed something I would like to add in the future. While adding different radix kernels support for small prime factors is not that hard, writing efficient scheduler is a much more challenging task (each sequence, even for power of 2 now is split differently targeting different architectures to optimize performance).

The Bluestein's algorithm typically used for arbitrary prime sizes requires both zero-padding and convolutions support which are already efficiently implemented, so it is also not completely out of reach.

why did you choose MPL 2.0?
It is a great open-source license for library projects. For example, Eigen uses it: https://eigen.tuxfamily.org/index.php?title=News:Relicensing...!