Hacker News new | ask | show | jobs
by hathawsh 2777 days ago
Hi! As someone who seems to struggle more than most to understand people on video calls, I'd like to give you my impressions.

Something struck me about the sample video. The very first sample included background noise, but it was very easy to understand regardless of the noise, probably because it was recorded by a pro microphone rather than a phone. Every other sample was far more difficult, regardless of noise removal. Noise removal doesn't really seem to help; in fact, any imperfections in the noise removal process actually make the audio more difficult to understand because I have to guess not only the speaker's voice and the noise but also the algorithm for noise removal.

What does help me is low frequency pickup. I think the first sample is easy because there are plenty of low frequency components that are later lost through the phone.

Low frequencies are presumably difficult to pick up due to the size of the microphone in a phone, but could there be a way to restore those frequencies through audio processing? It would be interesting to analyze the response of specific microphones to specific low frequencies and find patterns that an audio processor could use to restore the low frequency components.

Anyway, kudos for doing some very interesting work. I don't know how representative my experience is.

2 comments

In my experience it's the loss (or masking) of high frequencies that are the most problematic for understanding speech. The most important sounds in speech are consonants, which are higher frequency sounds. Combine this with foreign accents, and more often than not conference calls quickly degenerate into an unintelligible babble (for me, at least).
> I don't know how representative my experience is.

As someone who works with speech content, this seems unusual. Typically, low frequencies are reduced because there's not much useful voice signal there—for example, NPR typically rolls off frequencies below 250 Hz.

Thanks for your viewpoint!

Here's something concrete: the first phrase in the video ends with "small demonstration", but starting with the second instance, I distinctly hear "sall" instead of "small". In the version with the noise, the "m" sounds like an aberration of the noise and is detectable. With the noise removed, the "m" is replaced with a blip that sounds like an encoding error.