I've been developing Peer Calls since 2016. It's an open source WebRTC group conferencing solution, doesn't require a user account, has a full-mesh P2P and SFU mode (streaming through central server). Works on Android, iOS, Firefox, Chrome, Safari, latest Edge and there is no app - only a link that you need to share: https://peercalls.com/
For the audio side, Mumble is okay. Doesn't do feedback cancellation as well as Zoom, and it uses keys to auth which are really confusing for the user (and the UI isn't great at letting you switch which you're using).
But I got a mumble server up and running in no time flat, so it works that way, and the audio quality is amazing - far better than Zoom - as long as everybody's wearing properly-configured headsets.
Everyone starts with p2p and realizes it doesn't work. OG skype predates fast internet and expectations of extremely smooth conferencing between arbitrary number of users. A centralised server is basically a requirement for acceptable performance.
Federation isn't needed because you don't need to have servers that communicate with each other, you just need some server that can host a given video call. So any open source solution works, no federation architecture is required.
The "how" would depend on how you implement it. The servers could determine at the beginning which one gets to host the call and tell all participating clients to connect there, or there could be some client-server-server-client model akin to e.g. IRC, where you would save some server bandwidth in exchange for worse latency (more hops). I'm just spitballing here, mind you.
My question is not "how does federation work?" It is "in what situation is federation useful for video conferencing in a way that a single open source server is not?"