Hacker News new | ask | show | jobs
by baddox 5311 days ago
Cool idea, but I was sad that it wasn't finally an implementation of an idea I've had for a long time. My idea is to actually use video and audio information from distinct sources to create a single video/audio stream that is of better quality and/or completeness than any of the constituent parts. Essentially, my idea would do to video what Photosynth does for photos.

http://photosynth.net/

5 comments

There's something along those lines in the works, I don't see a name: http://grail.cs.washington.edu/projects/videoenhancement/

It combines a few high-resolution still images with a low-res video to improve the video quality:

Static scene: http://vimeo.com/1513129 Dynamic: http://vimeo.com/2937785

(Looks a bit old (3 years) so I don't know if it's still in development)

That is just so much harder; though I haven't looked at the website yet, seems to be overwhelmed right now. Besides, I would want one consistent audio signal instead of one that varies in noise, volume, whatever. Video from different sources is alright though, since we are used to switching scenes and cameras all the time.
The idea I'm talking about should provide a single consistent audio signal. I know nothing about audio processing, but it seems like it should be possible to take multiple bad audio signals and combine them into one signal that's better than any constituent audio source. Perhaps one audio source captured low frequencies well, while another captured higher frequencies better.
This seems unlikely to be possible. If you haven't captured any frequencies above 15kHz (which an average cell phone mic is unlikely to do), no amount of averaging, filtering, or combining will get them back. There will also be a considerable amount of distortion, since concerts tend to be so loud that even one's ears are distorting. Good luck separating physical distortion in the mic, limiter distortion in the analog or DSP stage, and clipping distortion at the ADC.

I think the best you could do is use the video to determine where someone was standing, and try to reconstruct some of the stereo information based on multiple recorders.

If you haven't captured any frequencies above 15kHz (which an average cell phone mic is unlikely to do), no amount of averaging, filtering, or combining will get them back.

I think this is technically not quite true. If two cell phones right next to each other are both sampling at 15kHz, in the best case you could combine their samples to get an equivalent sampling of 30 kHz. (Best case meaning phone 1 samples exactly half way between phone 2's samples.)

In practice, however, you would have to account for positioning and the fact that the phones' samples aren't perfectly offset from one another. It would require an amazing engineering feat to overcome this challenge, but I think it's within the realm of physically possible.

This is unfortunately so unlikely to as to be practically impossible (currently!).

If the microphones, ADCs etc on both phones are incapable of capturing frequencies of above e.g. 15Khz below a certain range, combining those signals definitely won't bring you any closer to the original signal. You may be able to cancel out a fair bit of noise given enough processing but you won't get back what hasn't been originally captured by either device.

That's before you get into phase problems from trying to combine two signals. A likely outcome is that the amplitude of some signals are increased whilst some are decreased due to phasing issues.

/fuzzily remembered music tech degree. May be too fuzzy though!

Sorry to be off topic here... but this is why I don't understand HN sometimes - the post above isn't nasty, augmentative, hurtful or 'bad' in any way, but instead of people responding to the poster they've downvoted him.

Isn't downvoting for removing bad content, not trying to silence someone you don't agree with?

Thoughts/comments?

(edit - post is no longer showing as greyed out/downvoted - but still, any comments?)

15khz is actually is very high frequency, most adults can barely hear it. The audio from phones is probably much more band limited than that. But that's the least of the challenges. People listen to and enjoy highly band limited music all the time: laptop speakers might have a frequency response of 500hz - 4khz.

The more challenging problem is the distortion from the phones being overloaded, crowd noise, built in limiters, different sample rates and compression. It is true that phase relationships from a single sound captured by multple sources can be very problematic.

However, the further the mics are away from the source the less this a problem at least with "phaseyness." This annoying artifact is a type of comb filtering , and it's based on the fact that two mics close to a sound source can be thought of really capturing the "same" sound at slightly different times. If the mics are far apart, the sound is no longer the same: it's picking up reflections from a myriad of sources, the phase relationships within the frequency spectrum have been smeared and shifted by traveling through air. This negates a lot of phase problems. The more likely problem is cancellation in the low frequencies which can be ameliorated with time alignment.

Only if the two phones are under-sampling the signal.

Chances are that there is a low pass filter in front of the phone's ADC, blocking signals above the Nyquist limit from reaching the sampler. Assuming brick wall filters (ie perfect cutoff), combining the signals will reduce variance (noise) but not give any information on frequencies above the cutoff frequency of the filter.

Brick wall filters don't exist though. What you might see is a miniscule amount of signal in the filter's stop band. Combining the signal from many many phones might reduce the variable enough to give useful information for frequencies a tiny bit above the cutoff frequency.

A cool project would be to gather the audio from every networked microphone in an area (mobile phones, laptops, ...) and use beam-forming techniques to reconstruct the sound pressure field as a function of position. My guess is that the system would be sensitive enough that it could do amazing things like capture conversations though walls or from long distances.

You are suggesting the Networked Cellphone Echolocation Device of Batman at the end of the Dark Knight... Cool!
In some cases you will have a good prior on the clean signal from studio recordings. Of course, "registering" a recording (or parts of a recording) to the video would be a formidable task in itself.
>take multiple bad audio signals and combine them into one signal that's better

The problem eventually comes down to the fact that "better" is subjective. We're in the murky realm of art here. Should your algorithm keep that fret noise or the squeaking of a vocalist's intake of breath? Are they "noise," or are they part of the performance?

>I know nothing about audio processing

Not wishing to be rude, but this much is very evident. Recording engineers position their microphones with millimetre precision in order to combat phase issues, and that is in an ideal studio scenario. Doing what you suggest is basically impossible.

Maybe I'm overstating it, you could probably do something and it'd be a nice bit of research, but you wouldn't get useful results in the way that you're imagining.

> Not wishing to be rude, but this much is very evident. Recording engineers position their microphones with millimetre precision in order to combat phase issues, and that is in an ideal studio scenario. Doing what you suggest is basically impossible.

Actually, that much I know, because I've done some amateur home recording. I know that, for example, when you mic a snare drum with two microphones that are pointed at each other, you have to put a phase inverter on one microphone. I also know my way around the basic processors for audio production (compressor, limiter, EQ, etc.).

What I don't know much about is the undoubtedly more advanced techniques which may or may not exist that could realize the idea I'm talking about. The best idea I can come up with is, if you had one audio source that captured the dynamics of a concert (perhaps from a phone that was far away from the house speakers), and another audio source that captured a clearer yet "smashed" sound (perhaps from a phone closer to the house speakers), perhaps you could apply a compressor to the second source that was keyed on the dynamics of the first. Again, I might be full of crap here.

I presume he means better in the sense that you'd try to remove overt noise, e.g. small conversations in the background, maybe wind noise. That sounds like it'd be possible to do. Improving a single video from multiple video sources would surely be impossible. Even with scene reconstruction etc you're not going to be improving the quality of any single video source...(?)
Presumably one would get rid of per-device degradation and compression artifacts.
This is the main thing (also tricky)
You can certainly automate crossfaded audio between multiple sources to try to get the cleanest copy, but it's hard. For instance, how do you decide whether it's noise or the letter "s" or the "chk" of a pick across muted guitar strings? The heuristics for "better than any constitutent audio source" can be extremely nuanced, algorithmically intensive, and still difficult to pin down, akin to speech recognition. Speaking purely to SaaS'y automated purposes, natch.

Typically what it seems you're talking about for audio here is similar to a matrix mix in the amateur/live audio world. People have been (manually) mixing soundboard audio with audience-recorded audio to improve the audio quality of recorded shows for some years now.

I don't know anything about audio processing algorithms, but (assuming there's a way), presumably with enough audio sources, there'd be a commonality between each one that describes the 'correct' sound. I.e. if there's different noise going on in each source (people talking around each microphone, at a gig, intermittently), you don't really need to decide which sound is 'clean' because you'd know which sounds are inconsistent...(?)
Sure, but it's determining what is "correct" that is the hard part. You could use a majority-rule if you have three or more sources, but the more additional sources are required starts getting into pretty niche territory and it still remains possible that the minority source is the most faithful one.
If both channels have a similar spike at the same frequency at the same time, it is probably part of the signal (not noise), so combine those, and dampen all others. This would cover your case, if the other channel had enough of the low/hi freq of the other to relate them. I reckon Shannon looked at exactly this in developing Information Theory (for telephone signals on flaky lines), and it's probably all textbook stuff now.
>at the same time

The thing is, sound doesn't travel all that fast when you consider the wavelengths of vocal-range soundwaves. Those spikes are not going to arrive at the same time on the different phones.

As ever with DSP, phase problems will be the ruin of you.

nice point, but they'd synchronize with an offset. I doubt absolute time would be used to synchronize the videos anyway; they'd be matched by content.

Or do you mean that different frequencies will travel at different speeds, enough to make (e.g.) high and low frequencies arrive at different times? Whoa, apparently it does (http://en.wikipedia.org/wiki/Speed_of_sound#Effect_of_freque...) but seems to be a small effect.

When setting up a big sound system, you have to delay the bassbins by up to 10ms, depending on the size of the cabinets.

Also, how do you calculate your offset? Consider that it is constantly changing.

We've actually thought about this, but it's a low yield proposition until we've shown there's a market for what we have now. It's not impossible, but it is very hard to do :)
This is CD Audio with two different camera sources, edited and synced. Is that what you're talking about?

http://www.youtube.com/watch?v=xW4UVjROQdw

I've been thinking the same thing about audio for donkeys years. ..