Hacker News new | ask | show | jobs
by quickgist 493 days ago
For some reason, most of these (and other narration AIs) sound like someone reading off a teleprompter, rather than natural speaking voices. I'm not sure what exactly it is, but I'm left feeling like the speaker isn't really sure of what the next words are, and the stresses between the words are all over the place. It's like the emphasis over a sentence doesn't really match how humans sound.
6 comments

Yup, and that's going to be the case until AI's can really model human psychology.

Speech encodes a gigantic amount of emotion via prosody and rhythm -- how the speaker is feeling, how they feel about each noun and verb, what they're trying to communicate with it.

If you try to reproduce all the normal speech prosody, it'll be all over the place and SoUnD bIzArRe and won't make any sense, and be incredibly distracting, because there's no coherent psychology behind it.

So "reading off a teleprompter" is really the best we can do for now -- not necessary affectless, but with a kind of "constant affect", that varies with grammatical structures and other language patterns, but no real human psychology.

It's a gigantic difference from text, which encodes vastly less information.

(And this is one of the reasons I don't see AI replacing actors for a looong time, not even voice actors. You can map a voice onto someone else's voice preserving their prosody, but you still need a skilled human being producing the prosody in the first place.)

What if you have it read the script, then say, “hey, at this point, what is the character feeling? What are they trying to accomplish? What is there relationship to each person in the scene?”

And then you get that and prompt the model to add inflection and pacing and whatever to the text to reflect that. You feed that into the speech model.

It seems like it could definitely do the first part (“based on this text, this character might be feeling X”); the second part (“mark up the dialogue”) seems easier; the third part about speech seems doable already based on another comment.

So we are pretty close already? Whatever actors are doing can be approximated through prompting, including the director iterating with the “actors”.

> What if you have it read the script, then say, “hey, at this point, what is the character feeling?...

Sure, but now how do you make sure all the answers to those questions are consistent? Across clauses, sentences, paragraphs? To do that, you need to have an entire understanding of human psychology.

And I haven't seen any evidence that LLM's possess that kind of knowledge at all, except at the most rudimentary level of narrative.

Just think of how even professional directors struggle to communicate to an actor the emotional and psychological feeling they're looking for. We don't even have words or labels for most of the things, and we say "you know how you feel in a situation when <a> and <b> but <c>? You know that thing? No, not that, but when <d>. Yeah, that." Most of these things operate on an intuitive, pre-verbal level of thinking in our brain. I don't think LLM's are anywhere close to being able to capture that stuff yet.

You have to shape the voices in the tools, if you just spit them out they're junk but if you take the time to shape the voice a bit it gets better quickly, this is a cheap 11labs voice with 30 seconds spent on some basic shaping: https://s.h4x.club/bLuNlJWx

Still a bit teleprompter-ish but there are tools to go in and adjust pace and style throughout and you probably hear a lot of stuff with people not using those creative features. 11labs might very well be one of the best bits of software I've used, it's a great deal of fun to play with and if you're willing to spend the time the results are superb - I don't even have a use case, I just like making them because they're fun to listen to, ha!

What is shaping the voice means?
You---can, really--slow, speed up or change, how, things sound, by, -- using queues like this, to control how the voice,,, - tells the story {{3sec}} - once you find a voice you like, you can go in and {{1sec}}

//

control how,

it goes about {{1sec}} story telling.

PlayHT's voices are nowhere near as good as ElevenLabs. These self-reported studies are marketing.

In any case, voice is such a thin vertical that I half expect the Chinese to release an open source TTS model that out-performs everything on the market. Tencent probably has one of these cooking right now.

Oof. I've heard recent AI generated narrators, and they were OK (much better than a few years ago, much worse than professional humans), but something about the digital postprocessing in this article's youtube video reminded me of fingernails on a chalkboard.

I couldn't get half way through.

This is an excellent point: there's some sort of oddity where it's hard to say its definitively AI, but I can definitively say it's a...low quality human?

I really like "off a teleprompter", it accurately characterizes the subtle dissonances where it sounds like someone who is reading something they haven't read before. 0:14 "infectious (flat) beatsss (drawn out), which is near diametrically opposed to the paired snappy 0:12 "soulful (high / low) vocals (high)."

Most of these models are trained on audiobooks, which could explain the teleprompter feeling vs a natural conversational feeling.