Hacker News new | ask | show | jobs
by gpvos 704 days ago
It just seems more logical for the OS to do that, rather than the application. Basically every application that uses microphone input will want to do this, and will want to compensate for all audio output of the device, not just its own. Why does the OS not provide a way to do this?
9 comments

> Basically every application that uses microphone input will want to do this

The OS doesn't have more information about this than applications and it's not that obvious whether an application wants the OS to fuck around with the audio input it sees. Even in the applications where this might be the obvious default behavior, you're wrong - since most listeners don't use loudspeakers at all, and this is not a problem when they wear headphones. And detecting that (also, is the input a microphone at all?) is not straightforward.

Not all audio applications are phone calls.

>The OS doesn't have more information about this than applications

the OP pointed out that this only works if he uses a browser monoculture

the OS does have more information than that, it can know what is being played by any/all apps, and what is being picked up by the mic

The "OS" isn't special here, apps can listen to system audio.

fwiw, you only need to know anything about outputs if you are doing AEC. Blind source separation doesn't have that problem and can just process the input stream.

> The "OS" isn't special here, apps can listen to system audio.

Even if this is true, it's easy to imagine such functionality being exploited by malicious apps as a security and/or privacy concern, particularly if the user needs a screen reader.

It definitely makes sense for the operating system to provide this functionality.

The OS can have multiple sound input devices for the application to choose from, "raw" and "fuckarounded with"
That doesn't make sense in the context of default devices. MacOS's AVKit (or is it CoreAudio?) APIs that configure the streams created on the device makes way more sense, since it's a property of the audio i/o stream and not the devices.
Assuming this isn't parody, the OS doesn't have to do it automatically. Having an application grab a microphone stream and say to the OS "take this and cancel any audio out streams" might be pretty useful.
I agree with that, but the point I'm trying to make is that audio i/o handling is pretty complicated and application specific. The idea I'm challenging is that "any app that wants microphone input wants this" is dubious. I'd say it's only a small number of audio applications that care about mic input want background noise reduced - and it makes sense for this to be configured per-input stream.

Really what would be nice is if every audio i/o backend supported multiplex i/o streams and you could configure whether or not to cancel audio based on that set of streams but not all output (because multi output-device audio gets tricky).

I'm honestly having trouble thinking of a case where I wouldn't want this.

I'm sure there are some niche cases, but in those cases, the application can specifically request that the OS turn off audio isolation.

The technique introduces latency and distortion because it's subtracting an estimate of sound that's traveling/reflecting in the listening environment, which is imperfect and involves the speed of sound.

That latency is within the tolerance that users are comfortable with for voice chat, and much less than video processing/transfer is introducing for video calls anyway, so it's a very obvious win there. Especially since those users are most interested in just picking out clear words using whatever random mic/speaker configuration happens to be most convenient.

But musicians, for instance, are much more interested in minimizing the delay between their voice or instrument being captured and returned through a monitor, and they generally choose a hardware arrangement that avoids the problem in the first place. And that's not really a niche use case.

Live video or audio chat is basically the only time you do want this. Granted, that’s a big chunk of microphone usage in practice, but any time you are doing higher fidelity audio recording and you have set up the inputs accordingly you absolutely do not want the artefacts introduced by this cancellation. DAWs, audio calibration, and even live audio when you’ve ensured the output cannot impact the recording all would want it switched off.

Default on vs default off is really just an implementation detail of the API though, as you say.

> Live video or audio chat is basically the only time you do want this.

If I'm recording a voice memo, or talking to an AI assistant, I would want this. Basically everything I can imagine doing with a PC microphone outside of (!) professional audio recording work.

That last case is important and we agree there needs to be a way to turn it off. I think defaults are really important though.

I gave an example, when I'm wearing headphones I don't want this enabled. If I'm recording anything, I probably don't want it on either. If I'm using a virtual output, I don't want AEC to treat that as a loudspeaker.
Every normal application already does it through the os because most do not care about this at all.

Music player, browser, games, video player...

Audio is not app specific

The only application were this is true is audio were you want full control and low latency.

I find your take very weird.

> Why does the OS not provide a way to do this?

Some do.

But you need to have a strong-handed OS team that's willing to push everybody towards their most modern and highly integrated interfaces and sunset their older interfaces.

Not everybody wants that in their OS. Some want operating systems that can be pieced together from myriad components maintained by radically different teams, some want to see their API's/interfaces preserved for decades of backwards compatibility, some want minimal features from their OS and maximum raw flexibility in user space, etc

> Some do

Which Operating systems do this?

macOS has done this in recent versions. Similarly it will do all the virtual background and bokeh stuff for webcams outside of the (typically horrific) implementations in video conferencing apps.
Others have already pointed out macOS/Linux, here's Windows:

https://learn.microsoft.com/en-us/windows-hardware/drivers/a...

As others have noted, this is trivial for most macOS and iOS apps to opt in to.

Frankly, I imagine its also available at the system level on Windows (and maybe Android and Linux) but probably only among applications that happen to be using certain audio frameworks/engines.

It doesn't seem to me that module-echo-cancel in Pulseaudio completely meets the requirements here (only one source), but it looks close, and seems in general like where you would implement something like this.

1. https://www.freedesktop.org/wiki/Software/PulseAudio/Documen...

I think module-null-sink and module-loopback could be used to create a virtual source which combines multiple sources, though the source/sink thing makes my head spin. Or, more simply, I suppose using the loopback of whatever audio output device does the combination (and the same mixing) for you, if you play all audio through one output device (which is most likely)?
> though the source/sink thing makes my head spin

Wait, what other audio paradigms are there?

so something for systemD then?
On mac/iOS, you get this using the AVAudioEngine API if you set voiceProcessingEnabled to true on the input node. It corrects for audio being played from all applications on the device.
My first thought in reading the question was “if your browser is doing that, your platform architecture has… some room for improvement”.
Having room for nontrivial improvement is, to be fair, a normal state of affairs for platforms.
This has certainly made conference calls significantly more usable. I feel like it must have come around during 2020, because I feel like pre-covid I would go around BEGGING everyone I did calls with to get a headset, because otherwise everyone else's voice would echo back through their microphone 0.75s later. Today I recently realized I could just literally do calls out loud on my laptop mic and speaker and somehow it works. Nice to know why!
This assumes there is an OS-managed software mixer sitting in the middle of all audio streams between programs and devices. Historically, that wasn't the case, because it would introduce a lot of latency and jitter in the audio. I believe it is still possible for a program to get exclusive access to an audio output device on Windows (WASAPI) and Linux (ALSA).
Historically, true, but nowadays it's pretty much standard for all the big OS.

Being able to get exclusive access/bypass the system via certain means (ASIO would be another) doesn't make it go away.

The OS doesn't know that the application doesn't want feedback from the speaker, and not 100% of applications will want such filtering. I think a best practice from the OS side would be to provide it as an optional flag. (Default could be on or off, with reasonable possibility for debate in either direction, but an app that really knows what it wants should be able to ask for it.)
There is a third place: a common library that all the apps use. If it is in the OS then it becomes brittle. If there's an improvement in the technology which requires an API change, that becomes difficult without keeping backwards compatibility or the previous implementation forever. Instead, there would be a newer generation common library which might eventually replace the first but only if the entire ecosystem chooses to leave the old one behind. Meanwhile there'd be a place for both. Apps that share use of a library would simply dynamically link to it.

This is the way things usually work in the Free Software world. For example: need JPEG support? You'll probably end up linking to libjpeg or an equivalent. Most languages have a binding to the same library.

Is that part of the OS? I guess the answer depends on how you define OS. On a Free Software platform it's difficult to say when a given library is part of the OS and when it is not.

> If it is in the OS then it becomes brittle

My experience is the opposite. When it's part of the OS, it's stable and you just say "you need OS version X or better" and it will just work. When it's a library, you eventually end up in dependency hell of deprecated libraries and differing versions (or worst case, the JavaScript ecosystem when the platform provides almost nothing and you get npm).

Depends on the OS I guess. When it's established enough, all distributions carry a high enough version that it's not an issue. If it's not established enough, I'd argue that it isn't ready to be part of an "OS" anyway (regardless of the definition of that word).
I suppose the OS probably makes something like this available, when using Voiceover on Mac and presenting in teams by default only the mic comes into teams, you need to do something to share the other processes audio.

That's mac of course but in my experience Windows is much more trusting of what it gives applications access to so I suppose the same thing is available there.

How sure are you that Basically every application wants this? So should there be a flag at the os level for enabling the cancellation? How do you control that flag?
It would be trivial to pass that flag in whatever API the application calls to request access to the microphone stream.
Did you just invent yet another linux audio stack?