| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bartman 85 days ago

Even in the commercial space, there’s a lack of production grade ASR APIs that support diarization and word level timestamps.

My experiences with Google’s Chirp have been horrendous, with it sometimes skipping sections of speech entirely, hallucinating speech where the audio contains noise, and unreliable word level timestamps. And this all is even with using their new audio prefiltering feature.

AWS works slightly better, but also has trouble with keeping word level timestamps in sync.

Whisper is nice but hallucinates regularly.

OpenAI’s new transcription models are delivering accurate output but do not support word level timestamps…

A lot of this could be worked around by sending the resulting transcripts through a few layers of post processing, but… I just want to pay for an API that is reliable and saves me from doing all that work.

2 comments

catlifeonmars 85 days ago

I wonder if you could run multiple models and average out the timestamps, kind of like how atomic clocks are used together and not separately

link

stavros 85 days ago

Isn't Elevenlabs the best in this?

link

gardnr 85 days ago

They can have issues with the timestamps: https://github.com/elevenlabs/elevenlabs-python/issues/707

link

bartman 85 days ago

I've not tested their speech-to-text yet, but based on the docs it looks promising. Thanks for the suggestion!

link

stavros 85 days ago

It's fantastic, and their diarization is spot on as well.

link