| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mnbbrown 79 days ago

Ran it over our internal dataset of ~250 recordings of people saying british postcodes (all kinds of accents, etc) - it's competitive for sure!

Soniox (stt-async-v4): 176/248 (71.0%) ElevenLabs (scribe_v2): 170/248 (68.5%) AssemblyAI (universal-3-pro): 166/248 (66.9%) Deepgram (nova-3): 158/248 (63.7%) AssemblyAI (universal-2): 148/248 (59.7%) Cohere (transcribe-03-2026): 148/248 (59.7%) Speechmatics (enhanced): 134/248 (54.0%)

P.s. how do I get this to render correctly on here?

3 comments

jilijeanlouis 79 days ago

did you try gladia: ranking #1 on STT blind test https://compare-stt.com/

link

mnbbrown 79 days ago

Added gladia..

- 1. Soniox (stt-async-v4): +176 new cases, running total 176/248 (71.0%)

- 2. ElevenLabs (scribe_v2): +26 new cases, running total 202/248 (81.5%)

- 3. Speechmatics (enhanced): +12 new cases, running total 214/248 (86.3%)

- 4. NVIDIA Parakeet (TDT 0.6B v2): +6 new cases, running total 220/248 (88.7%)

- 5. Mistral (voxtral-mini): +3 new cases, running total 223/248 (89.9%)

- 6. Gladia: +2 new cases, running total 225/248 (90.7%)

- 7. AssemblyAI (universal-2): +1 new cases, running total 226/248 (91.1%)

- 8. Deepgram (nova-3): +1 new cases, running total 227/248 (91.5%)

- 9. Cohere (transcribe-03-2026): +0 new cases, running total 227/248 (91.5%)

- 10. AssemblyAI (universal-3-pro): +0 new cases, running total 227/248 (91.5%)

link

scotty79 79 days ago

This benchmark should have Whisper large-v3 as one of the models.

link

Bolwin 79 days ago

Try two newlines between each one

link

ChrisMarshallNY 79 days ago

That, or add 4 spaces before each line (renders as a <pre>).

link

mkl 79 days ago

Two spaces: https://news.ycombinator.com/formatdoc

It's for code though, not lists or bullet points.

link

yorwba 79 days ago

Is the human baseline 248/248?

link

walthamstow 79 days ago

Assuming all the accents are British, I doubt it. I probably couldn't get all 248 myself.

link

mnbbrown 79 days ago

They are all transcribed by multiple blinded "accent natives". But yes, your point is valid - going to see if I can tease out the "single person accuracy".

link