Hacker News new | ask | show | jobs
by aabeshou 2671 days ago
ELI5: You can turn this problem into finding the best "convolution index", and fourier transforms make computing convolutions cheaper.

ELIUndergrad: (note that \* means multiplication, there doesnt seem to be a way to escape an asterisk)

Lets start by seeing how this is a convolution. We have the videoSpeech sequence, and the subtitle sequence - each is a vector, indexed by time, of 0's and 1's indicating whether there is speech in that time. We can imagine padding the sequences out on either side with 0's, and consider the alignment task as shifting the subtitle sequence left and right in time until we get the best alignment with the 1's in the videoSpeech sequence. We can express the goodness of alignment as the number of matching 1's, aka the sum over all times t of videoSpeech(t) \* subtitle(t). This is the definition of a convolution: the convolution of two sequences gives a new sequence where the value at index i is this sum above where one of the sequences is shifted by i. Mathematically, conv(videoSpeech, subtitle)(i) = sum( videoSpeech(t)\* subtitle(t-i)). So we can rephrase this problem as, find the index i which maximizes the value of the convolution sequence.

The discrete fourier transform is a function that takes a sequence and gives another sequence. It's relevant here because it "turns convolution into multiplication": fourier(videoSpeech)(i) \* fourier(subtitle)(i) = fourier(conv(videoSpeech, subtitle))(i).

So finally to solve the problem, we get the pointwise product sequence S = fourier(videoSpeech) \* fourier(subtitle), do the inverse fourier transform on it invFourier(S), and maximize invFourier(S)(i) over i.

3 comments

> ELI5: You can turn this problem into finding the best "convolution index", and fourier transforms make computing convolutions cheaper.

I tried this sentence on my 5-year-old and got a blank stare. He then proceeded to tell me a story about how Darth Vader is so scary and cool and that he's actually Luke's father. YMMV.

I guess GP actually meant ELWOT - Explain Like We're On Twitter.
lol, that makes sense. heres another try:

ELI5 v2: you can look at the sequences a different way, like tilting your head to look at them from a different angle. and from that angle, we can more easily try every possible way of putting them together, so that we can choose the best way.

I've beem working on a similar algorithm.

I'm writing a program which takes multiple channels of near-periodic audio wav files, and outputs a phase-stable oscilloscope video. This is done by FFT-correlating (a buffer of recent oscilloscope plots) with (the audio signal) (with quite a bit of added complexity for better results). Incidentally I'm also using Python/Numpy/ffmpeg.

It's most useful for "complex" chiptune like FM/SNES/tracker/MIDI music, which are easy to split into monophonic single-note channels. https://github.com/corrscope/corrscope Should I submit this separately (Show HN)?

Why not use good ol 'x' for multiplication? :-)
I keep AltGr+8 mapped to × for just such an emergency.
The Compose key works great for this (I use WinCompose on Windows). Most of the time I can guess the key combination. I think × is Compose, x, x.
'x' is a letter, not the times symbol.
But if you have to choose between '\*' and 'x', then 'x' might well be the best choice, no? :-)
If they had used ‘x’ instead, I would be scouring the comment for where the variable had been introduced.
Ok, use 'x' and write "(note that 'x' means multiplication)" instead of "(note that \* means multiplication, there doesnt seem to be a way to escape an asterisk)". More legible and shorter and 'symbol is already used for the purpose' and..
It is _a_ multiplication symbol. Why not use period ".".
Because there are a lot of those in the vicinity and it's very small?