The non-AI version is better: Using multiple microphones, you can use time-of-flight to isolate multiple voices in a crowd, and separate them into distinct, clear audio tracks.
This does make use of multiple microphones and timing between sound waves arriving. To my understanding, the machine learning part allows it to work even as the speaker/wearer move relative to each other, such as if you turn your head.