| Large text-to-speech and speech-to-text models have been greatly improving recently. But I wish there were an offline, on-device, multilingual text-to-speech solution with good voices for a standard PC — one that doesn't require a GPU, tons of RAM, or max out the CPU. In my research, I didn't find anything that fits the bill. People often mention Tortoise TTS, but I think it garbles words too often. The only plug-in solution for desktop apps I know of is the commercial and rather pricey Acapela SDK. I hope someone can shrink those new neural network–based models to run efficiently on a typical computer. Ideally, it should run at under 50% CPU load on an average Windows laptop that’s several years old, and start speaking almost immediately (less than 400ms delay). The same goes for speech-to-text. Whisper.cpp is fine, but last time I looked, it wasn't able to transcribe audio at real-time speed on a standard laptop. I'd pay for something like this as long as it's less expensive than Acapela. (My use case is an AAC app.) |
https://huggingface.co/canopylabs/orpheus-3b-0.1-ft
(no affiliation)
it's English only afaics.