| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dandiep 459 days ago

1) Previous TTS models had problems with major problems accents. E.g. a Spanish sentence could drift from a Spain accent to Mexican to American all within one sentence. Has this been improved and/or is it still a WIP?

2) What is the latency?

3) Your STT API/Whisper had MAJOR problems with hallucinating things the user didn't say. Is this fixed?

4) Whisper and your audio models often auto corrected speech, e.g. if someone made a grammatical error. Or if someone is speaking Spanish and inserted an English word, it would change the word to the Spanish equivalent. Does this still happen?

2 comments

jeffharris 459 days ago

1/ we've been working a lot on accents, so expect improvements with these models... though we're not done. Would be curious how you find them. And try giving specific detailed instructions + examples for the accents you want

2/ We're doing everything we can to make it fast. Very critical that it can stream audio meaningfully faster than realtime

3+4/ I wouldn't call hallucinations "solved", but it's been the central focus for these models. So I hope you find it much improved

link

wewewedxfgdf 459 days ago

As mentioned in another comment, the British accents are very far from being authentic.

link

jbaudanza 459 days ago

3) Whisper really needs to be paired with Silero VAD, otherwise the hallucination problem makes it almost unusable.

link

dandiep 459 days ago

100% and I’ve done this, but it’s still there.

link