Hacker News new | ask | show | jobs
by lunixbochs 1743 days ago
I ship speech recognition to users for full computer control (mixed commands and dictation) with a very tight feedback loop. I get a lot of direct feedback about any common issues.

One time I beta tested a new speech model I trained that scored very well on WER. Something like 1/2 to 1/3 as many errors as the previous model.

This new model frustrated so many users, because the _nature_ of errors was much worse than before, despite fewer overall errors. The worst characteristic of this new model was word deletions. They occurred far more often. This makes me think we should consider reporting insertion/replacement/deletion as separate % metrics (which I found some older whitepapers did!)

We have CER (Character Error Rate), which is more granular and helps give a sense of whether entire words are wrong (CER = WER) or mostly just single letters (CER much lower than WER).

-

I'd welcome some ideas for new metrics, even if they only make sense for evaluating my own models against each other.

GPT2 perplexity?

Phoneme aware WER that penalizes errors more if they don't sound "alike" to the ground truth? (Because humans can in some cases read a transcription where every word is wrong, 100% WER, and still figure out by the sound of each incorrect word what the "right" words would have been)

"edge" error rate, that is, the likelihood that errors occur at the beginning / end of an utterance rather than the middle?

Some kind of word histogram, to demonstrate which specific words tend to result in errors / which words tend to be recognized well? One of the tasks I've found hardest is predicting single words in isolation. I'd love a good/standard (demographically distributed) dataset around this, e.g. 100,000 English words spoken in isolation by speakers with good accent/dialect distribution. I built a small version of this myself and I've seen WER >50% on it for many publicly available models.

More focus on accent/dialect aware evaluation datasets?

+ From one of my other comments here: some ways to detect error clustering? I think ideally you want errors to be randomly distributed rather than clustered on adjacent words or focused on specific parts of an utterance (e.g. tend to mess up the last word in the utterance)

4 comments

At Amazon I set up an evaluation approach based on whether the system completed the desired task (in that context it was "did the search result using the speech recognition return the same set of items to buy as the transcript.)

https://scholar.google.com/citations?view_op=view_citation&h...

Interesting. It seems like in the "real world" WER is not really the metric that matters, it's more about "is this ASR system performing well to solve my use case" - which is better measured through task-specific metrics like the one you outlined your paper.
A pure ASR analog of this is how many/how much continuous utterances it enables. When I use tools like the one lunixbochs builds (including his own) the challenge as a user is trading of doing little bits at a time (slow, but easier to go back and correct) vs saying a whole ‘sentence’ in one go (fast and natural but you’re probably going to have to go back and edit/try again).

Sentence/command error rate (rate of 100% correct sentences/commands that don’t need any editing or re-attempting) is a decent proxy for this. It’s no silver bullet, but it more directly measures how frustrated your users will be.

If you really wanted to take care of the issues in the article, you could interview a bunch of users and find what percent of the, would go back and edit each kind of mistake (if 70% would have to go back and change ‘liked’ to ‘like’ then it’s 70% as bad as substituting ‘pound’ for ‘around’ which presumably every user will go back and edit).

The infuriating thing as a user is when metrics don’t map to the extra work I have to do.

> vs saying a whole ‘sentence’ in one go (fast and natural but you’re probably going to have to go back and edit/try again)

"probably going to have to go back and edit" is generally not the case with my Conformer model, which allows fast paced usage like this with practice: https://twitter.com/lunixbochs/status/1378159234861264896

Unfortunately that was the model I had in mind when I wrote that. I used it for maybe a month (I'm pretty sure), and my experience just wasn't as good as yours. It may be better than what preceded it, but it still drove me crazy. I came away with the conclusion that ASR as a technology just isn't there yet.

(and the conclusion that I need to prevent the return of RSI at all costs from now on. Don't get me wrong, I'm very thankful that talon does as well as it does. It was a job saver.)

Are you referring to the test you mentioned in this thread? https://news.ycombinator.com/item?id=26784732

If so, December predates Conformer, so you're talking about the sconv model, which is the model I was complaining about upthread - it was very polarizing with users, and despite the theoretical WER improvements, the errors were much more catastrophic than the model that preceded it.

In either case, I'm constantly making improvements - I'm in the middle of a retrain that fixes some of the biggest issues (such as misrecognizing some short commands as numbers), and I've done a lot of other work recently that has really polished up the experience with the existing model.

Your story reminds me of what's happened to Google's voice recognition over the last five years or so. It used to mis-hear words, but now it actively alters grammar and inserts words that sound nothing like what I actually said. Just try getting it to type the word "o'clock".
At face value, actively altering grammar to words that you didn't say sounds like the language model is very heavily weighed. I'm curious if you mean the keyboard? Because that recently switched to on device I think, which means much smaller models and compute used.
The behavior I'm complaining about happens on both, though as best I can tell the voice typing decides whether to use on-device or cloud-based depending on the conditions when you use it. If you cut your data off you'll get word-by-word recognition, whereas most of the times you're connected the whole sentence will pop in at the same time indicating it used the cloud.
Yes! Perplexity is a great idea. Although you could technically have a low perplexity prediction that is not similar to the ground truth transcription.

CER is definitely more granular. There are papers that basically count Deletions, for example, as 0.5(D) when calculating WER - since they consider Deletions "less bad", but if these weights aren't standardized then WER scores will be super hard to compare.

Personally I think some metric including some type of perplexity is the way to go.

So I was looking at SotA loss functions from a few years ago that weighted the CTC loss by the WER of the decoded phrase.

Could we generalize the WER weighting to optimize for the domain?

Something like

weight = w1 * WER + w2 * phonetic similarity + ...

which also requires a hyperparameter search... But we are already dumping so many GPU hours here.

I assume this is already being investigated by Google, though?

There are some a similar techniques where you use evaluation metrics to decide which data to train on each epoch.

I wonder if you could make that parameter trainable instead of using a hyperparameter search for it.

For phonetic similarity I've been playing with a dual objective system that could be promising.