Hacker News new | ask | show | jobs
by dandiep 459 days ago
1) Previous TTS models had problems with major problems accents. E.g. a Spanish sentence could drift from a Spain accent to Mexican to American all within one sentence. Has this been improved and/or is it still a WIP?

2) What is the latency?

3) Your STT API/Whisper had MAJOR problems with hallucinating things the user didn't say. Is this fixed?

4) Whisper and your audio models often auto corrected speech, e.g. if someone made a grammatical error. Or if someone is speaking Spanish and inserted an English word, it would change the word to the Spanish equivalent. Does this still happen?

2 comments

1/ we've been working a lot on accents, so expect improvements with these models... though we're not done. Would be curious how you find them. And try giving specific detailed instructions + examples for the accents you want

2/ We're doing everything we can to make it fast. Very critical that it can stream audio meaningfully faster than realtime

3+4/ I wouldn't call hallucinations "solved", but it's been the central focus for these models. So I hope you find it much improved

As mentioned in another comment, the British accents are very far from being authentic.
3) Whisper really needs to be paired with Silero VAD, otherwise the hallucination problem makes it almost unusable.
100% and I’ve done this, but it’s still there.