| I ship speech recognition to users for full computer control (mixed commands and dictation) with a very tight feedback loop. I get a lot of direct feedback about any common issues. One time I beta tested a new speech model I trained that scored very well on WER. Something like 1/2 to 1/3 as many errors as the previous model. This new model frustrated so many users, because the _nature_ of errors was much worse than before, despite fewer overall errors. The worst characteristic of this new model was word deletions. They occurred far more often. This makes me think we should consider reporting insertion/replacement/deletion as separate % metrics (which I found some older whitepapers did!) We have CER (Character Error Rate), which is more granular and helps give a sense of whether entire words are wrong (CER = WER) or mostly just single letters (CER much lower than WER). - I'd welcome some ideas for new metrics, even if they only make sense for evaluating my own models against each other. GPT2 perplexity? Phoneme aware WER that penalizes errors more if they don't sound "alike" to the ground truth? (Because humans can in some cases read a transcription where every word is wrong, 100% WER, and still figure out by the sound of each incorrect word what the "right" words would have been) "edge" error rate, that is, the likelihood that errors occur at the beginning / end of an utterance rather than the middle? Some kind of word histogram, to demonstrate which specific words tend to result in errors / which words tend to be recognized well? One of the tasks I've found hardest is predicting single words in isolation. I'd love a good/standard (demographically distributed) dataset around this, e.g. 100,000 English words spoken in isolation by speakers with good accent/dialect distribution. I built a small version of this myself and I've seen WER >50% on it for many publicly available models. More focus on accent/dialect aware evaluation datasets? + From one of my other comments here: some ways to detect error clustering? I think ideally you want errors to be randomly distributed rather than clustered on adjacent words or focused on specific parts of an utterance (e.g. tend to mess up the last word in the utterance) |
https://scholar.google.com/citations?view_op=view_citation&h...