| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by liminalsunset 759 days ago

The technology shown here is great - but I think the problem with most open-source voice assistants is that the voices just don't sound as good as the commercial ones. This one in particular appears to have a mixed accent and I'd imagine having it speak in French might produce more natural output. Examples of good voices are of course the OpenAI ones, and a bunch of the Nuance voices (I think the Tesla/Mercedes Me uses the "Ava" voice). Microsoft Azure's voices are decent too.

Even before neural TTS, some commercial TTS solutions were just better than others and a lot of that was the tone and timbre of the speakers. The challenge here, of course, is that

1) Good voice actors wouldn't want to train AI, especially not for free (though you could go far with people who aren't very professional, just committed)

2) You'll only have a limited set of voices, and people might not want that

3) Any really good model will have usage limitations because it sounds like some particular person

I'm wondering whether there are any techniques to create a good "synthetic" voice. Such as being trained on a large population of people, and then being able to generate character-consistent but uncorrelated with any input data output, the same way you can generate photos of people that don't exist but still look real. (I'm aware there's style transfer, but how do you create the style to begin with?)

I think that solving this may or may not be necessary to get a really robust voice AI solution esp one that's open source. Imagine choosing a voice for your AI personal assistant - wouldn't it be nice to be able to press a button and find *your* Alexa, not just *Alexa*?

You could also use that to find and create voices that "reflect your company's brand" or whatever, and not have to worry about changing it all the time when people come and go