Hacker News new | ask | show | jobs
by gagan2020 51 days ago
It is not good for text to speech (TTS) as well. I am trying it for few days. First of all 1.5B model documentation is not there. 0.5B realtime is shit model. I was converting text, line by line and it was randomly adding music and couldn't handle special characters like "…".

I really disappointed with this model to say the least.

3 comments

> ...it was randomly adding music...

I've been noticing this with the Mistral Voxtral TTS models too. I have my AI record a morning briefing podcast for myself, and occasionally there are sounds like music at the start (the british voice had a musical tone underneath that sounded a little like the end of the BBC News theme). I don't think I've ever encountered that with the OpenAI TTS models, so they're now my default go-to again.

yep, it seems this was trained on large amount of podcasts with ad jingles or phone call queues with elevator music. I was also pretty disappointed to run the TTS last week.
The 7B parameter Vibevoice TTS model is still the most impressive local TTS model i've tried. It was pulled by Microsoft a few days after its release due to "abuse potential" but it can be found in various community maintained huggingface repos.