i'm currently in the middle of a rabbit-hole exploration of being able to do in-place RADIX-2 FFT, DCT and DFT butterflys, the target is a general purpose function to cover each of those, in around 25 Vector instructions.
not 2,000 optimised loop-unrolled instructions specifically crafted for RADIX-8, another for RADIX-16, another for RADIX-32 ..... RADIX-4096 (as is the case in ffmpeg): 25 instructions FOR ANY 2^N FFT.
btw if you're interested in "real-world" SVP64 Vector Assembler we have the beginnings of an ffmpeg MP3 CODEC inner loop:
that's under 100 instructions, more than 4x less assembler for the same job in PPC64. and 6.5 times less assembler than ffmpeg's optimised x86 apply_window_float.S
you will no doubt be aware of the huge power savings that brings due to reduced L1 cache usage.
https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/...
i'm currently in the middle of a rabbit-hole exploration of being able to do in-place RADIX-2 FFT, DCT and DFT butterflys, the target is a general purpose function to cover each of those, in around 25 Vector instructions.
not 2,000 optimised loop-unrolled instructions specifically crafted for RADIX-8, another for RADIX-16, another for RADIX-32 ..... RADIX-4096 (as is the case in ffmpeg): 25 instructions FOR ANY 2^N FFT.
btw if you're interested in "real-world" SVP64 Vector Assembler we have the beginnings of an ffmpeg MP3 CODEC inner loop:
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=medi...
that's under 100 instructions, more than 4x less assembler for the same job in PPC64. and 6.5 times less assembler than ffmpeg's optimised x86 apply_window_float.S
you will no doubt be aware of the huge power savings that brings due to reduced L1 cache usage.