Hacker News new | ask | show | jobs
by rjwilmsi 968 days ago
I agree. When using the small or medium en models either for real-time speech recognition of a native English speaker or for transcribing podcasts of native English speakers the error rate is nowhere near 10%. I might say it's something like 1% of which the majority of errors are possibly subjective decisions about punctuation. But I have found the error rates are much higher on the tiny model and higher on the base model.

I assume therefore that the 10% word error rate is on very difficult audio such as pilots speaking to Air Traffic Control (distorted or clipped microphones with significant background noise), which I personally find can be difficult to 100% understand even though I'm a native English speaker and when both pilots and air traffic control are native English speakers.