| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kibbi 455 days ago

Large text-to-speech and speech-to-text models have been greatly improving recently.

But I wish there were an offline, on-device, multilingual text-to-speech solution with good voices for a standard PC — one that doesn't require a GPU, tons of RAM, or max out the CPU.

In my research, I didn't find anything that fits the bill. People often mention Tortoise TTS, but I think it garbles words too often. The only plug-in solution for desktop apps I know of is the commercial and rather pricey Acapela SDK.

I hope someone can shrink those new neural network–based models to run efficiently on a typical computer. Ideally, it should run at under 50% CPU load on an average Windows laptop that’s several years old, and start speaking almost immediately (less than 400ms delay).

The same goes for speech-to-text. Whisper.cpp is fine, but last time I looked, it wasn't able to transcribe audio at real-time speed on a standard laptop.

I'd pay for something like this as long as it's less expensive than Acapela.

(My use case is an AAC app.)

5 comments

5kg 455 days ago

May I introduce to you

https://huggingface.co/canopylabs/orpheus-3b-0.1-ft

(no affiliation)

it's English only afaics.

link

kibbi 455 days ago

The sample sounds impressive, but based on their claim -- 'Streaming inference is faster than playback even on an A100 40GB for the 3 billion parameter model' -- I don't think this could run on a standard laptop.

link

wingworks 455 days ago

Did you try Kokoro? You can self host that. https://huggingface.co/spaces/hexgrad/Kokoro-TTS

link

kibbi 455 days ago

Thanks! But I get the impression that with Kokoro, a strong CPU still requires about two seconds to generate one sentence, which is too much of a delay for a TTS voice in an AAC app.

I'd rather accept a little compromise regarding the voice and intonation quality, as long as the TTS system doesn't frequently garble words. The AAC app is used on tablet PCs running from battery, so the lower the CPU usage and energy draw, the better.

link

SamPatt 454 days ago

Definitely give it a try yourself. It's very small and shouldn't be hard to test.

link

ZeroTalent 455 days ago

Look into https://superwhisper.com and their local models. Pretty decent.

link

kibbi 455 days ago

Thank you, but they say "Offline models only run really well on Apple Silicon macs."

link

ZeroTalent 455 days ago

Many SOTA apps are, unfortunately, only for Apple M Macs.

link

dharmab 455 days ago

I use Piper for one of my apps. It runs on CPU and doesn't require a GPU. It will run well on a raspberry pi. I found a couple of permissively licensed voices that could handle technical terms without garbling them.

However, it is unmaintained and the Apple Silicon build is broken.

My app also uses whisper.cpp. It runs in real time on Apple Sillicon or on modern fast CPUs like AMD's gaming CPUs.

link

kibbi 455 days ago

I had already suspected that I hadn't found all the possibilities regarding Tortoise TTS, Coqui, Piper, etc. It is sometimes difficult to determine how good a TTS framework really is.

Do you possibly have links to the voices you found?

link

dharmab 455 days ago

Here's my code! https://github.com/dharmab/skyeye/tree/main/pkg/synthesizer

link

Ey7NFZ3P0nzAe 454 days ago

I heard good things about fish audio

link