Hacker News new | ask | show | jobs
by Arathorn 1862 days ago
The first cut will be using Jitsi as the engine, albeit with more appropriate UI, which gives you AEC (echo cancellation) from WebRTC and some level of AGC (normalisation). But we have plans to go far beyond that, and are very aware it’s hard work. However, pre-Matrix, the core team professionally built VoIP stacks, so we have some experience here (and our own WebRTC implementation should we need it :)
2 comments

I would love to see you use Jitsi and contribute back, where possible.

Also, if you move on for reasons that aren't bike-shedding then I'd love to know what the architectural/technical reasoning is.

The main reason to consider something different to Jitsi is to directly use Matrix for decentralised e2ee signalling to manage the media streams, and allow hybrid SFU and MCU models (like hangouts or zoom) rather than pure SFU like Jitsi. We do like Jitsi though and already contribute directly and indirectly - Jitsi’s E2EE is derived from Matrix, and the Matrix community just contributed a tonne of a11y fixes to it that just landed. But we’d still like a fully decentralised Matrix-native group call solution eventually.
Wonderful! I think you've answered any questions I might have already.
If you do AGC, you'll already be more usable than Discord, who apparently refuse to implement this. Whenever we're voice chatting in my group, it's extremely annoying that one person is quiet as a whisper and the other is AT 200% VOLUME AND CLIPPING.
The risk on AGC is that unless combined with good voice activation you can end up amplifying background noise and then deafening people when they start speaking. But yup, we'd definitely want to do this.
Oh, sorry, you meant input normalization, of course. Please please please also add output normalization, so all people in a room sound at the same volume!
I think it depends on which perspective your inputs/outputs are coming from ;) The way for all people in a room to sound at the same volume is to normalise the audio you capture from them. If you normalise the mixed signal you play back to the other users in the room, it's too late (at least if they speak over the top of each other), plus it'd be too hard as you'd be constantly yoyoing around.
Oh hmm, that's true, thanks for the clarification.