Hacker News new | ask | show | jobs
by cypress66 1034 days ago
I am skeptical on many of those. Speech recognition is not even close to human level. Whisper, and whatever Google uses will make a lot of mistakes on audio files that are trivial to any native speaker.
1 comments

In actual tests it is beyond human level. Humans actually mishear about 1 in 20 words during transcription tests; whisper does better.
But we don’t solely rely on how well we hear since we have knowledge that allows us to correct for poor hearing based on what is being said rather than forging ahead with a nonsense transcription. Machine transcription is definitely faster and cheaper but the end product isn’t “better,” and anyone who has read it can attest to that.
> But we don’t solely rely on how well we hear since we have knowledge that allows us to correct for poor hearing based on what is being said rather than forging ahead with a nonsense transcription.

Good voice transcription AI already do that too; that's why they work best if they know which language they're operating in, as that means they can use the language to create a model of the most likely words.

I think the most recent WWDC from Apple even has a video about adding custom vocabulary for their speech engine to pick up on that covered some details in this exact topic, though I can't search right now.

Undoubtedly so but I have yet to see one that doesn't make mistakes a human would be unlikely to. It is not an easy capability to reproduce and wouldn't have been my first choice if I wanted to talk about things it can do better than people.
> Undoubtedly so but I have yet to see one that doesn't make mistakes a human would be unlikely to.

Absolutely, AI is very rarely human in its failure modes, and often has novel and exciting failure modes instead.

But, on average… or so the marketing claims… it makes fewer mistakes.

For a while, it was possible to improve upon super-human chess AI by pairing them with a human; the combination was called a centaur. Eventually the AI got too good even for that as they stopped making the sorts of mistakes humans could spot, but in the meantime, even though they were superhuman, they had failure modes that we could help out with.

Assuming the intended audience is also humans, "exciting" errors seem worse in this instance, so I find it hard to credit these marketing claims.
Well, those "actual tests" clearly don't reflect reality. This is obvious if you actually use whisper.