Hacker News new | ask | show | jobs
by Joeboy 4336 days ago
The audio from the 60fps video sounds pretty bad though, which I suspect is mostly because of inherent maths/physics limitations rather than anything that software can improve.

Edit: They mention capturing frequencies up to five times higher than the 60Hz frame rate, which would mean a maximum frequency of 300Hz, which would suggest the equivalent of 0.6kHz audio, which is a 73.5th of the audio rate of a CD. I doubt you'd get intelligible speech from current consumer hardware using this technique.

3 comments

There is some small possibility of improvement through software techniques, such as maybe data assimilation, which can use information from surrounding time-frames to improve the measurement. This is assuming that the magnitude of vibrations changes a lot slower than the vibrations themselves, which is usually true, and how most audio compression works. It may be able to clean up the sound a little. However, I would say that the results they have obtained so far are very impressive.
The data comes in faster than 60 fps. A camera sensor doesn't capture the entire frame instantly every 1/60 second. It progressively scans through the frame over some measurable fraction of that 1/60 second. This is that quirk.

Suppose the camera scans 720 lines in HD every 1/60 second. Each row is offset in time by 1/43200 second. A rigid object could be slightly offset in space on each line of pixels, indicating that sound waves perturbed it in the time gap between when the camera captured each line. So that subframe video data can be turned back into audio at a much higher frequency than that apparent 60 Hz video sampling rate.

In other words, we're not just talking about 60 frames-per-second from a camera. It's really perhaps 43,200 rows per second, an enormously higher sampling frequency.

> The data comes in faster than 60 fps

Yes, yes, that was completely obvious from the article. We are getting thousands of "measurements" per second.

However, each of those measurements is incredibly inaccurate. Each one is trying to detect the change of colour of 1/200 of the colour range in a single pixel. You may be getting less than a single bit of entropy per measurement.

An advanced signal processing technique will look at the longer-term picture. Sound vibrations are not a random walk - they tend to be a combination of sine wave vibrations, where the rate of change of magnitude of each wavelength is significantly lower than the vibrations themselves. Therefore they are to a certain extent predictable, and this predictability is used by audio compression algorithms. The signal processing algorithm will have to make use of the extremely limited information coming from the measurements, and match up possible sets of varying sine waves that could be causing those measurements. This may be sufficient to reject some of the noise that we could hear on that video, and clean up the sound a bit, but it is quite a hard (and CPU-intensive) processing task.

Well the reader would read as fast as it can.

Let's say that it would read the entire image in 1/120 second, then it is waiting and does nothing another 1/120 second before it starts reading next frame.

The real number would be significantly smaller. Therefore they can not bump the sample rate more then five or six times. And I imagine they are using some intelligent algorithm to evenly space out the captured samples already.

A CD has a lot of overkill for basic speech. Doing a test here with some speech samples, 2kHz is ugly but intelligible, 1kHz is a mess but mostly understandable with effort, and 600Hz is almost useless for trying to find words (without any practice or computer assistance, of course).
"intelligible speech", are you sure that speech recognition really requires whole frequency range? Often it seems that data can be extracted after all, even if most of it is missing.