Hacker News new | ask | show | jobs
by anatari 4143 days ago
Voice recognition is not in an uncanny valley. Uncanny valley means there is a point where something that is less real is better than something that is more real. Pixar improves the scene by adding elements that are unrealistic. Another example is a preference for lower frame rate movies.

Right now, every incremental improvement to voice recognition improves its usefulness. It might appear that we're in an uncanny valley because voice recognition is barely usable right now versus completely unusable in the past, but there is no one that prefers worst voice recognition over better voice recognition.

6 comments

Of course nobody prefers less accurate transcription, but we're talking about more than transcription here. It's entirely reasonable to prefer a "command line" style interface like the XBox over a "conversational" style interface like Siri for the reason that the "conversational" style interface is less reliable even though it's in some sense more "real".
> The uncanny valley is a hypothesis in the field of aesthetics which holds that when features look and move almost, but not exactly, like natural beings, it causes a response of revulsion among some observers. The "valley" refers to the dip in a graph of the comfort level of beings as subjects move toward a healthy, natural likeness described in a function of a subject's aesthetic acceptability.

Wording by wikipedia. So apparently, there might be points on the graph where a slight increase in recognition performance actually freaks out some users.

I think there is a real danger that people are modifying/learning how to speak to computer voice recognition software. If voice recognition can't quickly become able to parse natural language, it will inevitably have to parse "I'm talking a dumb computer" cadence and inflection instead.

Incremental improvements are very bad in this regard.

These things are hard to reverse, too (people still speak with a very distinct "I'm speaking on a telephone" cadence today).

IMHO, people (and hence language) will always adapt in certain ways to get the message across. People already learned how to "google" and expect the same style of search queries to be effective elsewhere. When speaking on the telephone, people tend to slightly change their voice to counteract the channel noise (with acoustic consequences such as increased fundamental frequency ["pitch"], etc.). I would be surprised if a similar adaptation didn't happen for human-computer voice interaction, which would ultimately help making it work well enough to be useful. (Of course, using speech recognition to transcribe human-to-human interaction will still be barely usable..)
In text adventures there was a limited grammar that the parser understood that was English-like, which you had to express your actions in: get lamp, put lamp on table, eat fish, look sword. This is easier than if the parser tried to let you use fancy English sentences that it'd probably get wrong often and that you would make more mistakes in because the line is blurred between what you're allowed to say or not. That would be a point between the simplistic text adventure grammar and perfect natural language parsing, with the simpler grammar being preferable. I can see the same being true for voice commands.
As a side note, it might be funny to hook one of these voice command systems up to a z-machine VM. The command set is very limited as you pointed out, so it should easily be able to handle the input side. And the voices, while still robotic, seem pretty good as well. With games like Zork being fully text based you could easily turn it into a conversational game.
> but there is no one that prefers worst voice recognition over better voice recognition.

Siri versus XBox in his post? Maybe you're using different metrics for what's better or worse.

I think the OP referred to Uncanny valley for the graph. Think of that graph as the utility derived from voice recognition.