| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by touisteur 553 days ago
	You can mostly compute it for yourself or get an asymptotic feeling. Matmul is N^3 even with tensor cores eating a large part of it (with diminished precision, but ok) and FFT-based convolution is NlogN mostly. At some point - not that high - even the available TFLOPS available on tensor cores can't keep up. I was surprised by how far cuDNN will take you, but at some point the curves cross and matmul-based convolution time keeps growing at polynomial rate and FFT-based mostly linear (until it gets memory bound) and even that can be slashed down with block-based FFT schemes. Look up the thesis I posted earlier there's an introduction in it.

1 comments

almostgotcaught 553 days ago

> Matmul is N^3

are we talking about matmul or conv? because no one, today, is using matmul for conv (at least not gemm - igemm isn't the same thing).

like i said - the arms race has been going for ~10 years and everyone already discovered divide-conquer and twiddle factors a long time ago. if you do `nm -C libcudnn.so` you will see lots of mentions of winograd and it's not because the cudnn team likes grapes.

> Look up the thesis I posted earlier there's an introduction in it.

there are a billion theses on this. including mine. there are infinte tricks/approaches/methods for different platforms. that's why saying something like "fft is best" is silly.

link

touisteur 553 days ago

You ask how large and citation, and given one with sizes, you already have a billion ones. The article is about long convolutions (winograd being still a matmul implementation for square matrices and a pain to adapt to long convolutions and still N^2.3), and given comparisons with 10-years-optimized-(as you say) cuDNN and very naive barely running FFT-based long convolutions, showing actual O complexity matters at some point - even with the best hardware tricks and implementations.

I don't know what more to say ? Don't use FFT-based convolution for long convolutions if it doesn't work for your use cases or if you don't believe it would or should. And those of us who have benchmarked against SOTA direct convolution and found out FFT-based convolution worked better for our use cases will keep using it and talk about it when people ask on forums ?

link

almostgotcaught 552 days ago

> I don't know what more to say ?

Do you understand that you can't magic away the complexity bound on conv and matmul by simply taking the fft? I know the you do so given this very very obvious fact, there are only two options for how an fft approach could beat XYZ conventional kernel

1. The fft primitive you're using for your platform is more highly optimized due to sheer attention/effort/years prior to basically 2010. FFTW could fall into this category on some platforms.

2. The shape/data-layout you're operating on is particularly suited to the butterfly ops in cooley-tukey

That's it. There is no other possibility because again fft isn't some magical oracle for conv - it's literally just a linear mapping right?

So taking points 1 and 2 together you arrive at the implication: whatever you're doing/seeing/benching isn't general enough for anyone to care. I mean think about it: do you think the cudnn org doesn't know about cooley-tukey? Like just so happens they've completely slept on this method that's taught in like every single undergrad signals and systems class? So that must mean it's not a coincidence that fft doesn't rate as highly as you think it does. If you disagree just write your fftdnn library that revolutions conv perf for the whole world and collect your fame and fortune from every single FAANG that currently uses cudnn/cublas.

link