Not much: each participant has a pair of (amplitude, phase) values for each microphone. Filtering human voices and correlating sources to find the phase is not new.
Agree. It's even easy to do manually in audacity with recorded tracks. There are probably some ML innovations here (maybe isolating the voice signal from the background in a way that lets you label the phase info sufficiently to correlate it) but the main innovation that I can see is packaging it in a way that's useful in this context.
Filtering sources isn't new, of course, and doing it manually fiddling around with a recording you already made is one thing.
But doing it in automatically in real-time on equipment with unknown characteristics that leaves zero ghost signal behind... is, yes, insanely hard.