Hacker News new | ask | show | jobs
by robert-boehnke 2675 days ago
> Instead, we use the fact that "scoring all alignments" is a convolution operation and can be implemented with the Fast Fourier Transform (FFT), bringing the complexity down to O(n log n).

Not sure I understand this right - is this basically treating both binary strings as square waves, converting them to the frequency domain and determining the offset as a pitch shift between the two spectrograms?

2 comments

Precisely. If I could more easily get at the raw classifier output of webrtcvad, it should be possible to be even smarter (we could have square waves with any amplitude between -1 and +1, not just either -1 and +1, which should take into account the classifier uncertainty).

EDIT: err, I'm actually not sure about the pitch shift part, that's a bit of vocabulary I'm not familiar with. If you've seen the fast polynomial multiplication algo from CLRS, it's basically that. E.g. if we have strings 1101 and 0101, we can find the best alignment by looking at the exponent of the largest coefficient after multiplying

polynomial(1101)*polynomial(reverse(0101))

where polynomial(1101) = x^3 + x^2 - x + 1

and polynomial(reverse(0101)) = polynomial(1010) = x^3 - x^2 + x - 1

That is a very 'signals' based interpretation of the convolution theorem [1] applied to binary functions.

In essence, the fourier-transform is based on convolutions. F(eta) is essentially the convolution of f(x) with sin(eta x).* This is very loosely why the convolution theorem works.

[1] https://en.wikipedia.org/wiki/Convolution_theorem

* This excludes all cosine parts of the transform. Its neater to work in the complex domain and state that:

F(eta) = f(x) convolved with e^(eta i x)