Hacker News new | ask | show | jobs
by reissbaker 595 days ago
This is really cool. FWIW, existing open-source TTS engines are really bad in comparison to what you have here: I know this is voice-to-voice, but I think there'd be a lot of appetite to get this to also be multimodal and accept text (essentially making it a really good TTS model, in addition to a great voice-to-voice model).

I suppose someone could hack their way around the problem by finetuning it to essentially replay Piper (or whatever) output, only with more natural prosody and intonation. And then have the text LLM pipe to Piper, and Piper pipe to Hertz-dev. But it would be pretty useful to have it accept text natively!

2 comments

They are a team of 4. At that size, it's better for them to be focused on one thing than stretched out
Eh, that depends. A small model that's voice-and-text is probably more useful to most people than scaling up a voice-only model: the large voice-only model will have to compete on intelligence with e.g. Qwen and Llama, since it can't be used in conjunction with them; whereas a small voice+text model can be used as a cheap frontend hiding a larger, smarter, but more expensive text-only model behind it. This is an 8b model: running it is nearly free, it can fit on a 4090 with room to spare.

On the one hand, a small team focused on voice-to-voice could probably do a lot better at voice-to-voice than a small team focused on voice-to-voice+text. But a small team focused on making the most useful model would probably do better at that goal by focusing on voice+text rather than voice-only.

Their goal is not working on what's most useful for most people though. That's the domain of the big AI players. They are small and so specialising works best as that's where they can have an edge as a company.

At the end of the day, the released product needs to be good and needs to be done in a reasonable amount of time. I highly doubt they can do a generic model as well as a more specialised one.

But if you think you know better than them, you could try to contact them even though it looks they are crazy laser focused (their public email addresses are either for investors or employee candidates).

Yes, yes. This. Piper is already pretty good . . . and then this.

It may not be _them_ doing it, though.