Hacker News new | ask | show | jobs
by kccqzy 2400 days ago
Despite advances in deep learning, it's still very easy to tell a TTS voice from a human voice. Even companies like Apple that has paid special attention[0] to the naturalness of TTS can't get it completely.

Also, did you notice that in all major animated films (think Disney or Pixar), while the imagery are all computer-generated, the voices are not?

[0]: https://machinelearning.apple.com/2017/08/06/siri-voices.htm...

5 comments

Here's why I think animated films are done like that.

The best/only way to get most of that computer-generated imagery is by huge amounts of manual labour: designing, animating, simulating, sometimes motion-capturing. It's painstaking detail work involving many people.

The best way to get the voices is with a small amount of manual labour: voice acting.

If you put as much manual effort as the imagery into controlling the nuances of a TTS engine, you might get acceptable results, but it's far easier and cheaper to use voice actors. In fact, the easiest way to tell a TTS engine exactly what you want would probably be to voice act and have it mimic you. This might be worth trying to do if remapping vocal anatomy (e.g. woman voicing man or vice versa, or monster, etc.), but for most purposes it's easier to hire appropriate voice actors and/or manipulate the vocal recording audio than to use it to drive a resynthesis by simulation.

Also, it is a way to get A-list celebs involved and cash in on their popularity.
Maybe someday we will see (or hear) Siri's voice star in one of those Disney flicks.
Weird example since Siri is based on one real woman's voice. Maybe one of the WaveNet "personalities" might be a better example.
Or the Japanese voice idols?
Vocaloids?

Those are deliberately made to sound unnatural. Not to say it changes anything, and they've already shown up once or twice in anime.

(Though the only example I can name off the bat is Black Rock Shooter, and that doesn't include the voice. It's complicated. Mato is complicated, too.)

I use Apple's TTS to read books. It may not sound like a human, but after a few hours of listening to it the strange nature of the voice gets abstracted away by your brain. It's very functional. I knocked out Neal Stephenson's Anathem a week ago in about two days using TTS to read probably 75% of it.

What it doesn't do is the acting. In an audiobook, the voice actor will change their voice in various ways for dramatic effect, and in that respect the book becomes something like a radio drama. With TTS you're getting "just the book". I think that's the major difference and the refuge in which voice actors might hope for continued employment.

I can imagine using TTS to catch up on news articles, magazine articles, reviewing a textbook, maybe even listening to opinion columns while out on the go or multitasking with my Photoshop time, but I can't imagine it doing anything except ruining a novel, or anything else involving drama or comedy.
Far from ruining novels, it's quite pleasant. After you've become accustomed to it, it becomes a very low-fatigue way to read for extended periods of time. I find books read by TTS are just as immersive as reading print. In fact the experience of TTS and reading print seem closer to me than print vs audiobooks.

I guess what I'm saying is don't underestimate neuroplasticity. I wager you could even achieve casual fluency with morse code if you listened to it long enough. I'm under the impression that some telegraph operators did.

That's really interesting. Basically the audiobook is more of an interpretation of the text. It's perhaps closer to a movie adaptation. Once you lose the exception for the TTS to "tell you a story" but rather to "tell you the words", it becomes just a different input stream. I didn't think about it like that until now. I'll give it a try.
I listen to a lot of non-fiction in TTS but I haven't found it all that satisfactory for novels. Although part of the reason is if I'm listening to a novel via TTS it's because there isn't an audiobook available and I've had to do some hacky OCR to get the text in the first place.
I do this as well. I used Samsung's TTS engine at first, but Google's has mostly caught up. As a bonus, I can switch between listening (in the car or working out) and reading (most other times) without losing my place.
The audible audiobook version of Anathem is extremely good in my opinion. Probably one of my favorite readings... I’m surprised that speech synthesis does a reasonable job given the language involved.
The Audible version of The Baroque Cycle is one of the best voice performances I've heard.

Having said that, some fairly small scale audiobooks that have the authors narrating them are also very good as you can hear the interest and the passion of the author in their subject:

e.g.

https://www.audible.co.uk/pd/Wilding-Audiobook/B07DDMZ16R

https://www.audible.co.uk/pd/Exactly-Audiobook/B07CQ3RPKC

So far, I think this is a matter of time.

https://www.youtube.com/watch?v=DWK_iYBl8cA impresses me and it just feels like a low quality recording.

It's concerning given how computing power and resources advance over time.

check the same kind of work out of Lyrebird team

[1] https://www.descript.com/lyrebird-ai

It costs 100s of millions of dollars to animate and market a feature Pixar film. It would be very penny-wise/pound-foolish to try to skimp on voice-acting. Animated films will be the last place to start using TTS. Honestly they might ADR people in low budget live-action with TTS before it reaches the major animation studios.
> Animated films will be the last place to start using TTS.

Well, big budget theatrical animated films from major studios like Pixar or DreamWorks, sure.

But most animated films aren't those.

The imagery is computer generated because it is the cheaper solution compared to hand-drawn animation.