Hacker News new | ask | show | jobs
by mostlyjason 2389 days ago
I really hope you adopt the latest models particularly streaming attention variants. I think you should validate with users the assumption that latency is more important than WER.

IMHO the WER is more important than latency improvements in the millisecond range. The most frustrating thing is having to dictate over and over and the transcription is incorrect each time.

Consider that the time to a correct transcription is the latency plus error correction. If error correction is manual it will be orders of magnitude slower, so optimize for WER.

I’m terms of competition, Siri has latency in the 5+ second range due to the network call especially in area with poor data rates. I think a client side model like yours will easily win in this category. If you’re already ahead here, why not focus on WER next?

Another great capability is to generate alternative transcriptions for words with low confidence values to allow for quick error correction. Do you offer something like this today?

Also, consider the long term view that new models are constantly being released and refined. It’d be best to have an architecture that allows quick replacement without a lot of hand tuning, or where the tuning can be automated to a greater extent.

1 comments

Thanks for all the hard work you have put in so far @reubenmorais

+1000 to @mostlyjason's comment - Great latency figures mean nothing if the word error rate is high, since it dents confidence in the output (so why use DeepSpeech?) and (as the parent comment notes) necessitates manual error correction.

I would love to see a future release focus on optimizing WER for these reasons.