Hacker News new | ask | show | jobs
by mstoehr 5837 days ago
Most commenters are focusing on relatively high level features of decoding speech. It is important to also be aware that there is still great debate about what are the acoustic correlates of linguistic events in speech. It seems that our words are composed of subunits (usually taken to be phones out of the IPA--but there's work on alternatives) but what exact acoustics correspond to the phones are is still unsettled: lots of debate and mediocre recognition performance

Undoubtedly there is much room for improvement on these higher-level features but computers are still well behind humans in large vocabulary isolated keyword spotting: this is a task where one word from a very large corpus of words is spoken and the human or computer has to guess what that word was. Computers do poorly relative to humans (particularly in noise), which suggests that many of the mistakes that computers make is in not being able to interpret the acoustics correctly.

1 comments

And to make this problem worse, there's a great deal of variation among different dialects of the same language, which means that there's no 1:1 mapping of IPA phones to written letters.