Pretty cool. But I'll eat my hat if a regular webcam can pick up enough detail in regular lighting to do this. Plus 90% of the time the webcam will capture a users chest and the wall behind them, hardly useful for visual microphones.
The saying is "attacks only get better". Its likely that more can be done with less pixels but more software.
But hardware can also get better. Surely the next-gen of laptops have depth-sensing cameras too? Its becoming an integral part of game console motion detection, and normal smartphones will have them too e.g. hype I found by googling: https://3dprint.com/117809/depth-sensing-phone-cameras/
This was an interesting thing to see when it came out and keep people aware what is possible. Maybe there is even more possible using this technique.
But Nevertheless activating one of the many microphones around (mobile phones, phones, laptops, "echo" like devices, speech controlled televison) would concern me much more then.
The range of the human voice goes from 85hz to 255 Hz, and that means a webcam should record at about 500hz to be able to capture enough information to reconstruct voice with good quality.
Because webcams record at 60hz (max), they can only capture enough data to reconstruct sound at 30hz, way below the human voice range.
Webcams record faster than 60 Hz. Sound perturbs the recorded image every scanline, not just every frame. The techniques that reconstruct audio from video do it by looking not frame by frame, but line by line. 60 Hz times 720 vertical resolution is 43200 Hz and way more than enough data.