| > It has a very obvious "autotune" To me it has a very obvious "Hindi is my native language" accent. I mean after literally the first sentence: "The research team at Meta is excited to share our work...". Ouch. The "our work": just ouch. I was wondering why it wasn't a native english speaker presenting the video when the video is precisely about generating speech. The first seven seconds are particularly bad. Don't get me wrong: I've got a lovely french accent when I speak english. This has either been trained on too many audiobooks spoken by non-natives or they've used their own tech, where the "reference audio" given as input was from a non-native. In any case something is seriously off. At 1:59, the "Hi guys, thanks you for tuning in! Today we are going to show you..."... That is obviously an Hindi speaker speaking (it's an example of fixing a real voice by removing background sounds). I think that the main voice of the video was done by the same person who did the example at 1:59. And I think that they used their example of using a "reference audio". And that person ain't a native english speaker. To compare: when the reference audio uses a proper english accent (the example with the "diverse ecosystem" at 0:52), then the output from the text-to-speech sounds native. I think they just fucked the demo video and it may already be ready for prime time. |