| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dylanbfox 1745 days ago
	Interesting. It seems like in the "real world" WER is not really the metric that matters, it's more about "is this ASR system performing well to solve my use case" - which is better measured through task-specific metrics like the one you outlined your paper.

1 comments

6gvONxR4sf7o 1745 days ago

A pure ASR analog of this is how many/how much continuous utterances it enables. When I use tools like the one lunixbochs builds (including his own) the challenge as a user is trading of doing little bits at a time (slow, but easier to go back and correct) vs saying a whole ‘sentence’ in one go (fast and natural but you’re probably going to have to go back and edit/try again).

Sentence/command error rate (rate of 100% correct sentences/commands that don’t need any editing or re-attempting) is a decent proxy for this. It’s no silver bullet, but it more directly measures how frustrated your users will be.

If you really wanted to take care of the issues in the article, you could interview a bunch of users and find what percent of the, would go back and edit each kind of mistake (if 70% would have to go back and change ‘liked’ to ‘like’ then it’s 70% as bad as substituting ‘pound’ for ‘around’ which presumably every user will go back and edit).

The infuriating thing as a user is when metrics don’t map to the extra work I have to do.

link

lunixbochs 1745 days ago

> vs saying a whole ‘sentence’ in one go (fast and natural but you’re probably going to have to go back and edit/try again)

"probably going to have to go back and edit" is generally not the case with my Conformer model, which allows fast paced usage like this with practice: https://twitter.com/lunixbochs/status/1378159234861264896

link

6gvONxR4sf7o 1745 days ago

Unfortunately that was the model I had in mind when I wrote that. I used it for maybe a month (I'm pretty sure), and my experience just wasn't as good as yours. It may be better than what preceded it, but it still drove me crazy. I came away with the conclusion that ASR as a technology just isn't there yet.

(and the conclusion that I need to prevent the return of RSI at all costs from now on. Don't get me wrong, I'm very thankful that talon does as well as it does. It was a job saver.)

link

lunixbochs 1745 days ago

Are you referring to the test you mentioned in this thread? https://news.ycombinator.com/item?id=26784732

If so, December predates Conformer, so you're talking about the sconv model, which is the model I was complaining about upthread - it was very polarizing with users, and despite the theoretical WER improvements, the errors were much more catastrophic than the model that preceded it.

In either case, I'm constantly making improvements - I'm in the middle of a retrain that fixes some of the biggest issues (such as misrecognizing some short commands as numbers), and I've done a lot of other work recently that has really polished up the experience with the existing model.

link

6gvONxR4sf7o 1745 days ago

I totally forgot about that conversation! Yeah I must be referring to sconv then. I was thinking of the new custom-trained model you were releasing to your paid beta patreon subscribers, and confused the two.

As a side rant, it turned out that simply stepping away from work for a few weeks around the holidays nearly fixed my RSI, which makes me so sad about the nature of my career whenever it crops back up.

Btw, any chance you've done any work on the `phones` or related tooling? I remember that (and editing in general) being a pain point.

link

lunixbochs 1745 days ago

Yeah for sure, breaks are really important.

sconv was especially disappointing because it looked so good on metrics during my training, but the cracks really started to show once it entered user testing. Conformer has been so much less stressful in comparison because most user complaints are about near misses (or completely ambiguous speech where the output is not _wrong_ per se if you listen to the audio) rather than catastrophic failure.

There's another interesting emergent behavior with my user base as I make improvements, which is that as I release improved models allowing users to speak faster without mistakes, some users will speak even faster until there are mistakes again.

Edit: Yep! There have been several improvements on editing, though that's more in the user script domain and my work has still been mostly on the backing tech. I'm planning on working on "first party" user scripts in the future where that stuff is more polished too.

link