Sometimes creativity is the only thing holding people back from exploiting the natural insects of the web.
Case in point, ever wonder why those captchas include street addresses or 'pick the shape with a hole in it?' Spoiler: you're building training data and validating training data.
a) Stuff similar to this has been available for ages and there are no (good) open source voice recognition packages.
b) It requires absolute mountains of training data which we don't have.
c) It requires designing a suitable network, which I'm not sure if we have, but I would doubt it.
d) It requires training a network on the mountains of training data using an immense computing cluster, which we requires money that we don't have.
Don't hold your breath.