Hacker News new | ask | show | jobs
by brw12 3148 days ago
I'm constantly surprised by the poor contextual quality of speech recognition. I think the basic audio recognition does well, but when there is ambiguity, it seems like systems that are popularly considered high-performing degrade drastically. For instance, I'm using Dragon NaturallySpeaking to dictate this right now, but if I say a certain punctuation mark at the end of a sentence, half the time it's going to say excavation mark!

Ditto with Google's Google Now assistant, or whatever the heck it's called these days. I have a Pixel 2 phone (Dragon heard "pixel to phone" -- it doesn't have up-to-date context on proper nouns in the news), but when I tried to create a calendar event using "Create calendar event... meet Bruno for pizza", it heard "MIT pronoun for pizza". It has hundreds of samples of my voice, and it already knew I was creating an event! "Meet" has to be one of the most common first words used in events.

It seems to me like there is pretty low hanging fruit, and that we need more focus on flexibility and resourcefulness rather than acting as though we're moving from 99.5% accuracy to 99.6%.