Hacker News new | ask | show | jobs
by jchb 2219 days ago
There is Speech Synthesis Markup Language (SSML). Amazon Polly and Google text-to-speech supports it, although the best neural-model based voices only support a small subset.
1 comments

Ah thank you, that's very interesting.

So that's not markup along "emotional" lines, but rather along "technical" attributes such as speed, pitch, volume, pause between words, and so on.

Obviously coding those things in XML manually would be a nightmare. Now I find myself wondering if 1) these technical parameters can be used to synthesize speech that does sound like a reasonable approximation of emotion (or if they're insufficient because changes in resonance and timbre are crucial too), and 2) if there are tools that can translate, say, 100 different basic emotional descriptions ("excitedly curious", "depressed but making effort to show interest", etc.) into the appropriate technical parameters so it would be usable.

Anyways, just a fascinating area of study.