Hacker News new | ask | show | jobs
by fomojola 2182 days ago
As an aside: I've been reading the AWS blog posts from Jeff Barr, but ignoring the Amazon Polly audio conversions. I actually listened to it today, and not only is it not terrible, but there's a moment (around 1:05 in or so) where you can actually hear an inhalation! I know @jeffbarr is sometimes in these threads: is that a standard feature of AWS Polly, or is there some preprocessing that is generating SSML to control cadence, and if so how do we get our hands on THAT?
3 comments

Thanks for listening to Polly! That's Polly's Matthew voice, there's not special preprocessing, and you can make your own text sound like that.
Breathing is a feature that you can turn on in Amazon Polly since 2018. There are automated, manual, and mixed modes depending on how much you want to manually control the breaths. More info here: https://aws.amazon.com/about-aws/whats-new/2018/03/amazon-po...
It still sounds very robotic to me. I think Google's WaveNet sounds much more natural: https://cloud.google.com/text-to-speech#section-2
There's a personal taste element: I agree with you that certain WaveNet voices sound better (I've actually used them for video narration with some success). The breathing caught me off guard: it took me a minute to identify THAT as the element that was there but I implicitly wasn't expecting to hear.

The breathing + pausing at commas/full stops and general cadence was frankly superior to what I've seen with Google Cloud Voice, which is why I was curious if preprocessing was done. I've generally had to do multiple manual passes with Google Cloud Voice to get audio output that didn't sound robotic.