Hacker News new | ask | show | jobs
by touisteur 549 days ago
The convolution by FFT overlap and save can have very low intermediate storage (none on GPU with cuFFTDx for example). And most of the time, the IFFT doesn't have to happen right away, lots processing can still be performed in the frequential domain.

Having each of 18k CUDA cores of a L40s perform small 128-points FFTs and with very little sync or overlap manage long filters... is pretty efficient by itself.

There's a lot happening in the HPC world on low-rank (what you're intuiting with PCA), sparse and tiled operations. I have a hard time applying all this to 'simple' signal processing and most of it lacks nicer APIs.

I've seen lots of interesting things with 'irregular FFT' codes and working on reducing either the storage space necessary for FFT intermediate results, sometimes through multi-resolution tricks.

Look up Capon filters and adaptative filtering in general, there's a whole world of tricks there too. You might need a whole lot of SVDs and matrix inversions there...

Bust mostly if you're on a GPU there's a wealth of parallelism to exploit and work-around the 'memory-bound' limits of FFT-based convolution. This thesis https://theses.hal.science/tel-04542844 had some discussion and numbers on the topic. Not complete but inspiring.