| Yup, and that's going to be the case until AI's can really model human psychology. Speech encodes a gigantic amount of emotion via prosody and rhythm -- how the speaker is feeling, how they feel about each noun and verb, what they're trying to communicate with it. If you try to reproduce all the normal speech prosody, it'll be all over the place and SoUnD bIzArRe and won't make any sense, and be incredibly distracting, because there's no coherent psychology behind it. So "reading off a teleprompter" is really the best we can do for now -- not necessary affectless, but with a kind of "constant affect", that varies with grammatical structures and other language patterns, but no real human psychology. It's a gigantic difference from text, which encodes vastly less information. (And this is one of the reasons I don't see AI replacing actors for a looong time, not even voice actors. You can map a voice onto someone else's voice preserving their prosody, but you still need a skilled human being producing the prosody in the first place.) |
And then you get that and prompt the model to add inflection and pacing and whatever to the text to reflect that. You feed that into the speech model.
It seems like it could definitely do the first part (“based on this text, this character might be feeling X”); the second part (“mark up the dialogue”) seems easier; the third part about speech seems doable already based on another comment.
So we are pretty close already? Whatever actors are doing can be approximated through prompting, including the director iterating with the “actors”.