Can you really distinguish sound emitted from the speaker from someone with a hoarse voice? Furthermore, what about medium? Traveling different medium should be considered.
Well, the device knows exactly what signal it's putting out through the speaker, so it can predict what the microphone will pick up. It doesn't know what someone with a hoarse voice is about to say.
That attack described in the video isn't something the phone is producing and picking up (most phones already ignore what they playback), but rather a sound played by a laptop picked up by the phone.
And further, the attack described is a sentence that doesn't sound to a human like "Ok Google" or "Hey Siri", or whatever
I'd guess most speaker generated audio would be from a compressed source. Audio compression generally cuts off frequencies that we can not hear. When we speak though, we must be generating a lot of inaudible frequencies. It could be determined by checking if those exist or not.
It's definitely possible from a technical perspective. It's very similar to the way echo cancellation works on phones already.
Since the output is known, similar input can then be stripped. This only works when both the output of the speaker and input of the microphone are known.
This can't be done to determine whether another speaker, such as a TV, generated the output.
Wouldn't that be similar to the tech that removes echoing from voice/video chatting when someone isn't using headphones? I wonder how intense that kind of processing is.