Hacker News new | ask | show | jobs
by mstoehr 5885 days ago
Actually most research effort in speech is more on the language side rather than the signal processing of the speech signal. So I think many people have a similar intuition as yourself.

Bear in mind though, that humans significantly outperform machines in tasks where isolated or streams of non-sense syllables are said: i.e. "badagaka" is said and humans can pick out the syllables whereas computers can have a lot of difficulty (in noise in particular).

Computers start approaching human performance most when there is a lot of linguistic context to an utterance. So it appears that humans are doing something other than using semantics.

1 comments

Good points, but I think we underestimate how much situational context humans use when they interpret language. Sometimes we can communicate with very little language simply because we know what the purpose of the interaction is.

Another thing I keep wondering about is why so little emphasis is put on dialog. When humans don't understand something, they ask, or offer an interpretation and ask whether it's the right one.

Speech recognition systems don't seem to do that. They say "Sorry, I could not understand what you said. Please repeat". That's not very helpful for the computer of course. It should say: "Huh, Peas? Why would anyone rest in peas for heaven's sake??". Then the human could sharpen his SS and say "PeaCCCEE!!! not peas. I'm not talking about food, I'm talking about dying!".

Context is huge for human interpretation. If you've ever have someone address you in a different language than you were expecting, you know what I mean. It's almost like you can imagine the search just going deeper and deeper without finding anything that makes sense until it swaps in the other language and go: Ah, you said "good morning"! :-)
Especially embarrassing when somebody addresses you in your native language, and you expected something different.
It is true that humans do use situational context. In the cases where semantics is important and complex for understanding an utterance a computer will fail even more because it won't get the semantics or the speech signal.

On the topic of dialog, this is arguably the area that speech recognition has gained in over the last nine years. Prior to 2001 there were not many usable dialog systems and (depending on your definition of "usable") there are many usable dialog systems deployed in call centers around the world.

Most call center dialog systems have a rudimentary system asking for people to repeat things when it doesn't understand. Although, if it asks more than once the callers tend to get very angry.

Nobody would use a system that interrogated you on every fifth word. That would actually be a step worse than silent failure on every fifth word.
It shouldn't interrupt you once every 5 words of course. What it should try to do is to create a model of what you meant to say. At some point, if the system is unsure, it should ask you to confirm or correct what it has understood so far.