Hacker News new | ask | show | jobs
by crazygringo 2243 days ago
This is a really interesting technical concept.

Capturing high-quality audio in a meeting room for videoconferencing is a notoriously complicated problem.

Microphones are crazy sensitive and pick up things like footsteps and conversations outside the door, shuffling feet and tapping on keyboards, and construction and HVAC noise like you wouldn't believe.

So filtering those things out, and then capturing the best quality audio from the current speaker, and trying to get everyone's voice at roughly the same volume whether they're sitting directly across from the microphone or are piping up from the corner of the room...

...and do this all while cancelling 100% of the echo that might be coming from two or three speakers at once...

...it's an insanely hard problem. Beamforming microphones absolutely help in a huge way, because if you know the speaker's voice is coming from 45° then knowing that any sound coming from any other angle can be removed is a really helpful piece of info.

Now, with beamforming microphones, the precise relative location and direction of each mic is known. The idea of creating one big beamforming mic for the room out of people's individual mics is... insanely hard, but super cool.

It's interesting to me that this article is about measuring the quality of voice transcription, rather than about the quality of audio in an actual meeting. But I suppose the voice transcription quality measurement is simply a proxy for the speaker audio quality generally, no?

This could actually be a huge step forward in not needing videoconferencing equipment in meeting rooms. So far, one of the biggest reasons has actually been dealing with echo and feedback -- when people are in the same call with multiple devices in the same room, it tends to end badly. But if the audio processing is designed for that... the results could actually be quite amazing.

And it's well-known that the "bowling alley" visual of meeting participants (camera at the end of a long conference table) isn't ideal. If each participant has their own laptop camera on themselves, it could be a vastly better experience for remote participants.

6 comments

> And it's well-known that the "bowling alley" visual of meeting participants (camera at the end of a long conference table) isn't ideal. If each participant has their own laptop camera on themselves, it could be a vastly better experience for remote participants.

My company pushes us to have any conference that will include remote people from our desks, even if some or most of the attendees are in the same physical local. It means that no audio is dropped bec of too much cross-talk and that all attendees are on the same footing. Only real issue is that we don’t automatically get headsets, you need to request/expense it.

Yeah, this is just a much better way to conduct meetings.

I've been in many meeting rooms where there's a single projector/tv, and the person controlling it only shows either the remote cameras OR their screen (while they're sharing), so that isolates the remote people even more. (I've also been the remote person in this situation, and it definitely feels more like being an occasionally noisy fly on the wall then a full participant).

Everyone also gets their full desktop (big/multiple monitors, full keyboard, etc).

It'll be interesting to see what happens post-lockdown.. will the people miss the benefits of "one remote = all remote" and have more empathy for remote people, or will we go back to the same old?

> It'll be interesting to see what happens post-lockdown.. will the people miss the benefits of "one remote = all remote" and have more empathy for remote people, or will we go back to the same old?

I think it'll be like everything that could be learned during this time, someone has to recognize the lesson and actively work to implement it.

My main issue with using my desk is that I normally keep my laptop closed and off to the side, so if I just open it the view of me is in profile and doesn't look like I'm paying any attention, IT is loath to buy an external webcam for anyone because "every laptop comes with a webcam," luckily I was able to source a spare one they had. But I know that most of the desks are set up the same way as mine, so most people either choose to use their laptop screen as the main monitor or just don't enable video for the call.

> have any conference that will include remote people from our desks,

This is great for the participants, but absolute hell for everyone else in open offices, or even shared offices.

Really late to the party, but I love this concept. I feel like this would be really difficult in an open office/shared office.

I enjoy team based offices, 7-10 man rooms. Even in there, this would probably be a nightmare unless you had this tech running in real time so you don't get microphone crosstalk/echo.

None the less, I really like the spirit of the system.

Apparently IBM tested a system where participants faces were projected onto dummies faces in a real room, voice related through speakers on each dummy, then the whole thing recorded and broadcast to participants.
> Capturing high-quality audio in a meeting room for videoconferencing is a notoriously complicated problem

Not from my experience of 20 years ago setting up VC systems, biggest issue was video and making sure lighting was good, and plane wall behind (sky blue was good colour for that).

Audio wise, was many desk standing mic's (can't recall main brand) but was a few.

Did have one issue once with setting up a connection to a remote french company, was no audio from there end - turned out that the technician at the other end was sat on the end of the table in front of the camera and also was sat upon the mic that was on the table. Soon solved but still, most funny.

Back then we had VC systems that could roll into a room on a cart and worked well - picturTel IIRC being one solution back then and PolyCon being another that soon overtook them as well as doing wonderful conferencing microphones.

But as bandwidth got cheaper and more accessible, many meeting rooms that would be too noisy visually for VC became accessible and the need for dedicated rooms drifted away for more client usage.

Though audio from my experience back then was the easy part.

Quality video is definitely hard too, but it's just not as important.

If we have beautiful, well-lit video feeds if every participant, but no one can hear what they're saying -- that's a deal breaker. The other way around, if we have clean, crisp audio from everyone and inconsistent video, at least the conversion can still move forward.

Even just watching a random clip on Youtube, it's fairly easy to forgive a low quality video feed but bad audio gets really annoying very quickly. Any lag or stuttering or artifacts etc in the audio is a dealbreaker for most people.
What I meant was it's a complicated problem for the software and microphone engineers. Not for installation! :)
"Please, before we start the meeting, can everyone in the room allow app microphone access for the best experience?"
It's a hard and interesting signals problem with surely many other benefits but surely money would be better spent just buying better mics and audio gear for an office.
> This could actually be a huge step forward in not needing videoconferencing equipment in meeting rooms. So far, one of the biggest reasons has actually been dealing with echo and feedback -- when people are in the same call with multiple devices in the same room, it tends to end badly. But if the audio processing is designed for that... the results could actually be quite amazing.

> And it's well-known that the "bowling alley" visual of meeting participants (camera at the end of a long conference table) isn't ideal. If each participant has their own laptop camera on themselves, it could be a vastly better experience for remote participants.

It seems to me that these days the simpler solution to both these problems is to just have people use airpods.

That doesn't work for multiple people in the same room all wearing AirPods. Everyone's mic picks up everyone's voice, not just the "real" speaker.

And a lot of meetings have most (e.g. 10) people in a single room, with another handful (e.g. 5) of remote participants.

Airpods don't really capture other people's voices really well. It's basically inaudible unless you've got it plugged in your ear pointed at your mouth
How about a regular headset?
Most headsets come with omnidirectional microphones. Even some "noise cancelling" microphones are actually omnidirectional, just with an arm long enough to be reasonably close to the mouth. When I recently decided I need a cardioid microphone on my headset, I've ended up spending about $300 total.
Which headset was that? I’ve been looking for one with a cardioid mic but wasn’t able to track any down.
AT BPHS1. That has an XLR connector, so I also got a Scarlett Solo. I also recommend replacement cushions, e.g. from Brainwavz, as the stock ones are for people with some tiny ears (my ears hurt pretty badly after a long stream of meetings).

I've heard good things about Sennheiser PC 8 for tighter budgets, but haven't tried personally.

Perhaps this is why simpler isn't always better. You aren't really solving the problem and the cost of your solution outweighs its "simplicity".

15 people around the table in a conference room and each person wearing airpods (thus being connected to their own device) is an expensive solution with a lot of points of failure.

> it's an insanely hard problem

Not much: each participant has a pair of (amplitude, phase) values for each microphone. Filtering human voices and correlating sources to find the phase is not new.

That's not how amplitude and phase work, you don't "set them" on a microphone.

Filtering sources isn't new, of course, and doing it manually fiddling around with a recording you already made is one thing.

But doing it in automatically in real-time on equipment with unknown characteristics that leaves zero ghost signal behind... is, yes, insanely hard.

> That's not how amplitude and phase work, you don't "set them" on a microphone.

I never said you "set them". The obviously are a product of the location of the microphone in relation to the speaker.

Agree. It's even easy to do manually in audacity with recorded tracks. There are probably some ML innovations here (maybe isolating the voice signal from the background in a way that lets you label the phase info sufficiently to correlate it) but the main innovation that I can see is packaging it in a way that's useful in this context.