I'd be nice to be able to tag specific words for emphasis in a sentence, where the tagging process would be made via semantic NLU tasks and the voice alteration by the TTS model
I think there's alrrady research for "TTS after NLG" that does this, since a NLG system can export meta-info about emphasis, in addition to the text (at least in case of non-end2end NLG systems).
Whether that makes a big difference in practice, I don't know.