| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by blululu 1740 days ago
	Feels like a lot of the counter examples listed involve contractions and conjugation errors. Saying 'like' and 'liked' are different words is a strong interpretation. Similarly, 'I am' and 'I'm' are really not distinct words so counting that toward an error rate is a bit too literal. The objections could be solved by a decent parser. That said, weighting insertions and deletions equally is clearly a problem. Certain words ought to have more weight in a model. Weighting words by something like 1/log(frequency) might be a good start since less common words tend to be more important for meaning.

2 comments

lunixbochs 1740 days ago

Interesting, maybe instead of my proposed perplexity metric, we measure the difference in both utterance and per-word perplexity between ground truth and output with a strong language model? Ideally it's low - the language model should consider each predicted word to be "about as likely in context" as the closest ground truth words.

In other words, measure LM perplexity on the ground truth words, then on the predicted words, and minimize the difference in perplexities. Ideally with a general model like GPT2 or BERT or something that you aren't using anywhere in your actual ASR.

This may even be more tolerant of errors in the ground truth transcription than raw WER

link

dylanbfox 1740 days ago

> since less common words tend to be more important for meaning.

Exactly. Errors with proper nouns are usually more problematic than errors with stop words, yet they're weighted equally in the WER calculation. Ie, deleting "Bob" and "but" both count as a deletion of the same degree according to WER, but we as humans know that deleting "Bob" is potentially a lot more problematic than deleting "but".

link

lunixbochs 1740 days ago

You could weigh insertions by how much perplexity they add (sum), deletions by how much perplexity they remove (-sum), and replacements by how big the ppl difference is in the replaced word (abs(sum)). And report this as a 4-part score (combined mean, then separate i/d/r). Lower is better.

Theory being you don't want to add or remove confusing words, but common stop words are less of an issue.

I'm not sure how this interacts with a multi word replacement, where the new words together make sense but independently make no sense to the LM.

link