ELI5: You can turn this problem into finding the best "convolution index", and fourier transforms make computing convolutions cheaper.
ELIUndergrad:
(note that \* means multiplication, there doesnt seem to be a way to escape an asterisk)
Lets start by seeing how this is a convolution. We have the videoSpeech sequence, and the subtitle sequence - each is a vector, indexed by time, of 0's and 1's indicating whether there is speech in that time. We can imagine padding the sequences out on either side with 0's, and consider the alignment task as shifting the subtitle sequence left and right in time until we get the best alignment with the 1's in the videoSpeech sequence. We can express the goodness of alignment as the number of matching 1's, aka the sum over all times t of videoSpeech(t) \* subtitle(t). This is the definition of a convolution: the convolution of two sequences gives a new sequence where the value at index i is this sum above where one of the sequences is shifted by i. Mathematically, conv(videoSpeech, subtitle)(i) = sum( videoSpeech(t)\* subtitle(t-i)). So we can rephrase this problem as, find the index i which maximizes the value of the convolution sequence.
The discrete fourier transform is a function that takes a sequence and gives another sequence. It's relevant here because it "turns convolution into multiplication": fourier(videoSpeech)(i) \* fourier(subtitle)(i) = fourier(conv(videoSpeech, subtitle))(i).
So finally to solve the problem, we get the pointwise product sequence S = fourier(videoSpeech) \* fourier(subtitle), do the inverse fourier transform on it invFourier(S), and maximize invFourier(S)(i) over i.
> ELI5: You can turn this problem into finding the best "convolution index", and fourier transforms make computing convolutions cheaper.
I tried this sentence on my 5-year-old and got a blank stare. He then proceeded to tell me a story about how Darth Vader is so scary and cool and that he's actually Luke's father. YMMV.
ELI5 v2: you can look at the sequences a different way, like tilting your head to look at them from a different angle. and from that angle, we can more easily try every possible way of putting them together, so that we can choose the best way.
I'm writing a program which takes multiple channels of near-periodic audio wav files, and outputs a phase-stable oscilloscope video. This is done by FFT-correlating (a buffer of recent oscilloscope plots) with (the audio signal) (with quite a bit of added complexity for better results). Incidentally I'm also using Python/Numpy/ffmpeg.
It's most useful for "complex" chiptune like FM/SNES/tracker/MIDI music, which are easy to split into monophonic single-note channels. https://github.com/corrscope/corrscope Should I submit this separately (Show HN)?
Ok, use 'x' and write "(note that 'x' means multiplication)" instead of "(note that \* means multiplication, there doesnt seem to be a way to escape an asterisk)". More legible and shorter and 'symbol is already used for the purpose' and..
The problem being solved essentially is: you have two binary strings, and you want to offset one of them so that they match up the best. For each offset, you're taking a dot product of one sequence with the offset version of the other. This is the same as computing the convolution of the two sequences together (https://en.wikipedia.org/wiki/Convolution). Computing this naively would be O(n^2) (doing linear work for each possible offset).
One property of the Fourier transform is that convolution in the time domain corresponds to element-wise multiplication in the frequency domain (https://en.wikipedia.org/wiki/Convolution_theorem), so you can compute the convolution efficiently by taking the FFT of both series, doing element-wise multiplication, and then taking the inverse FFT of the result.
It has nothing to do directly with FFT, which is why it is confusing:
What he is looking for is maximal correlation between two binary series (think "Pearson's r or r-square correlation coefficient"). Now, correlation is just like convolution except one series is flipped around on the time axis. Which means, if you have an efficient way to compute convolutions (and you do, through FFT), you have an efficient way to compute correlations.
(I don't think 5 year olds heard of Pearson's correlation coefficient or convolutions, but ... that's the best I can do).
In many ways, correlation and covariance are more fundamental than convolution - they are closely and directly related to the inner product.
And the only reason to use the FFT here is convenience (i.e., it is fast and simple to compute), but the correlation/convolution property applies to many "transform domains" - Laplace, Z, Cosine, Sine, and a few others (all of which are closely related among themselves, but only a few easy to compute numerically).
ELIUndergrad: (note that \* means multiplication, there doesnt seem to be a way to escape an asterisk)
Lets start by seeing how this is a convolution. We have the videoSpeech sequence, and the subtitle sequence - each is a vector, indexed by time, of 0's and 1's indicating whether there is speech in that time. We can imagine padding the sequences out on either side with 0's, and consider the alignment task as shifting the subtitle sequence left and right in time until we get the best alignment with the 1's in the videoSpeech sequence. We can express the goodness of alignment as the number of matching 1's, aka the sum over all times t of videoSpeech(t) \* subtitle(t). This is the definition of a convolution: the convolution of two sequences gives a new sequence where the value at index i is this sum above where one of the sequences is shifted by i. Mathematically, conv(videoSpeech, subtitle)(i) = sum( videoSpeech(t)\* subtitle(t-i)). So we can rephrase this problem as, find the index i which maximizes the value of the convolution sequence.
The discrete fourier transform is a function that takes a sequence and gives another sequence. It's relevant here because it "turns convolution into multiplication": fourier(videoSpeech)(i) \* fourier(subtitle)(i) = fourier(conv(videoSpeech, subtitle))(i).
So finally to solve the problem, we get the pointwise product sequence S = fourier(videoSpeech) \* fourier(subtitle), do the inverse fourier transform on it invFourier(S), and maximize invFourier(S)(i) over i.