| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by staticautomatic 622 days ago
	I’ve been building a production app on top of ASR and find the range of models kind of bewildering compared to LLMs and video. The commercial offerings seem to be custom or built on top of Whisper or maybe nvidia canary/parakeet and then you have stuff like speechbrain that seems to run on top of lots of different open models for different tasks. Sometimes it’s genuinely hard to tell what’s a foundation model and what isn’t. Separately, I wonder if this is the model Speechmatics uses.

2 comments

leetharris 622 days ago

We released a new SOTA ASR as open source just a couple of weeks ago. https://www.rev.com/blog/speech-to-text-technology/introduci...

Take a look. We'll be open sourcing more models very soon!

link

mkl 622 days ago

> These models are accessible under a non-commercial license.

That is not open source.

link

threeseed 622 days ago

Exactly. It is source available but not open source:

https://opensource.org/osd

link

yalok 622 days ago

that's great to hear! amazing performance of the model!

for voice chat bots, however, shorter input utterances are a norm (anywhere from 1-10 sec), with lots of silence in between, so this limitation is a bit sad:

> On the Gigaspeech test suite, Rev’s research model is worse than other open-source models. The average segment length of this corpus is 5.7 seconds; these short segments are not a good match for the design of Rev’s model. These results demonstrate that despite its strong performance on long-form tests, Rev is not the best candidate for short-form recognition applications like voice search.

link

staticautomatic 622 days ago

I'll check it out.

FWIW, in terms of benchmarking, I'm more interested in benchmarks against Gladia, Deepgram, Pyannote, and Speechmatics than whatever is built into the hyperscaler platforms. But I end up doing my own anyway so whatevs.

Also, you guys need any training data? I have >10K hrs of conversational iso-audio :)

link

woodson 622 days ago

There’s just not a single one-size-fits-all model/pipeline. You choose the right one for the job, depending on whether you need streaming (i.e., low latency; words output right when they’re spoken), run on device (e.g. phone) or server, what languages/dialects, conversational or more “produced” like a news broadcast or podcast, etc. Best way is to benchmark with data in your target domain.

link

staticautomatic 622 days ago

Sure, you're just going to try lots of things and see what works best, but it's confusing to be comparing things at such different levels of abstraction where a lot of the time you don't even know what you're comparing and it's impossible to do apples-to-apples even on your own test data. If your need is "speaker identification", you're going to end up comparing commercial black boxes like Speechmatics (probably custom) vs commercial translucent boxes like Gladia (some custom blend of whisper + pyannote + etc) vs [asr_api]/[some_specific_sepformer_model]. Like, I can observe that products I know to be built on top of whisper don't seem to handle overlapping speaker diarization that well, but I don't actually have any way of knowing if that's got anything to do with whisper.

link