Hacker News new | ask | show | jobs
by ebjaas_2022 1492 days ago
You can hear that the human readers place emphasis based upon an understanding of the meaning of the text that they're reading, and also based upon an understanding of the humans at the receiving end. It seeps through that they're human. The AI generated samples are good, but they're bland in comparison. The human emphasized words are typically not emphasized in the AI generated samples.
2 comments

I hear more stress by the human reader, but it isn't always in the appropriate spot, IMO.
That might be true. But I think a good reader, such as a news anchor or a voice actor, will know where to put the emphasis and the pauses, in order to help the listener along. It's value-adding. I think most people who do it professionally will have this capacity.
I'd be nice to be able to tag specific words for emphasis in a sentence, where the tagging process would be made via semantic NLU tasks and the voice alteration by the TTS model
That'd be interesting because it'd split the problem into "parse and highlight what should be emphasised" and "do the TTS".
I think there's alrrady research for "TTS after NLG" that does this, since a NLG system can export meta-info about emphasis, in addition to the text (at least in case of non-end2end NLG systems).

Whether that makes a big difference in practice, I don't know.