|
|
|
|
|
by kccqzy
2400 days ago
|
|
Despite advances in deep learning, it's still very easy to tell a TTS voice from a human voice. Even companies like Apple that has paid special attention[0] to the naturalness of TTS can't get it completely. Also, did you notice that in all major animated films (think Disney or Pixar), while the imagery are all computer-generated, the voices are not? [0]: https://machinelearning.apple.com/2017/08/06/siri-voices.htm... |
|
The best/only way to get most of that computer-generated imagery is by huge amounts of manual labour: designing, animating, simulating, sometimes motion-capturing. It's painstaking detail work involving many people.
The best way to get the voices is with a small amount of manual labour: voice acting.
If you put as much manual effort as the imagery into controlling the nuances of a TTS engine, you might get acceptable results, but it's far easier and cheaper to use voice actors. In fact, the easiest way to tell a TTS engine exactly what you want would probably be to voice act and have it mimic you. This might be worth trying to do if remapping vocal anatomy (e.g. woman voicing man or vice versa, or monster, etc.), but for most purposes it's easier to hire appropriate voice actors and/or manipulate the vocal recording audio than to use it to drive a resynthesis by simulation.