Hacker News new | ask | show | jobs
by TacticalCoder 1100 days ago
> It has a very obvious "autotune"

To me it has a very obvious "Hindi is my native language" accent. I mean after literally the first sentence: "The research team at Meta is excited to share our work...". Ouch. The "our work": just ouch. I was wondering why it wasn't a native english speaker presenting the video when the video is precisely about generating speech.

The first seven seconds are particularly bad.

Don't get me wrong: I've got a lovely french accent when I speak english.

This has either been trained on too many audiobooks spoken by non-natives or they've used their own tech, where the "reference audio" given as input was from a non-native.

In any case something is seriously off.

At 1:59, the "Hi guys, thanks you for tuning in! Today we are going to show you..."... That is obviously an Hindi speaker speaking (it's an example of fixing a real voice by removing background sounds).

I think that the main voice of the video was done by the same person who did the example at 1:59. And I think that they used their example of using a "reference audio".

And that person ain't a native english speaker.

To compare: when the reference audio uses a proper english accent (the example with the "diverse ecosystem" at 0:52), then the output from the text-to-speech sounds native.

I think they just fucked the demo video and it may already be ready for prime time.

4 comments

Maybe they deliberately chose an accent that wasn't native English to demonstrate the style transfer capability. I think the ability of the system to output accented voices is a strength not a weakness, so long as it can do other accents too.
I'm surprised you had such a negative reaction to the Hindi accent! To me, it was no more difficult to understand than my colleagues who speak English as a second language.

To me, this is a style choice for the demo. Not evidence that they "fucked" it up. Accents are common - everyone has one! It's nice to see the model can support your personal voice even if it's not completely neutral English.

> It's nice to see the model can support your personal voice even if it's not completely neutral English

There is no such thing as "neutral" English.

>Nonetheless, a form of speech known to linguists as General American is perceived by many Americans to be "accent-less", meaning a person who speaks in such a manner does not appear to be from anywhere in particular. The region of the United States that most resembles this is the central Midwest, specifically eastern Nebraska (including Omaha and Lincoln), southern and central Iowa (including Des Moines), parts of Missouri, Indiana, Ohio and western Illinois (including Peoria and the Quad Cities, but not the Chicago area).
> Nonetheless, a form of speech known to linguists as General American is perceived by many Americans to be "accent-less"

TLDR: "neutral English" is like "neutral water temperature" - it feels neither hot not cold because it matches ones body temperature. It's subjective, and terming it "temperatureless water" is even less accurate.

I'd put emphasis on "perceived" and "American" in that statement, and also note that this is limited to regional accents: General American is unambiguously American. Similar to General American, many countries have developed a "Newscaster" accent, e.g. Received Pronunciation for Britain, but it's not considered neutral as it is the "upper class" accent.

In every language I've known well enough to distinguish accents, I've realized newscasters adopt a distinct accent/cadence that's not commonly used. But I wouldn't call it "accentless" - it's just another accent that may/may not have evolved from a culturally dominant regional accent (or dominant figure from a specific region.)

The accent was obvious enough that I wonder if they might have not been trying to hide it at all? Maybe they just happened to pick somebody from the team with a very mild accent.
The accent was part of the show. They demonstrated how to create an accent from scratch: sample a voice in the accent's original language (हिंदी, français) and then have that voice read text in the target language (English). Voilà, accent.