| When Google's announcement [1] was posted a few days ago, I listened to their samples and heard an odd effect in the "chocolate bread" sample (the video chat example) [1], which is not mirrored in this article. On that sample, I felt [2] that the Lyra version exaggerates the pronunciation of the phrase 'with chocolate' in a way that meaningfully differs from the speaker's original. It weakens the voiced 'th' to nothingness, and overshoots both the lead consonant and first vowel of 'choc', and then proceeds to wash the entire rest of the sentence with a peculiar brightened voice that's high, lacks consonant definition, and is close to ringing. I'm guessing it's actually style transfer, because though the result sounds not much like the speaker's original, the result is reminiscent of the speech pattern and accent that people with East Asian and Southeast Asian ancestry adopt when speaking American English. It was surprising, given that the speaker doesn't sound like that in the original. I wonder if others hear this too. While Lyra sounds richer and wider-band than Opus or Speex at these bitrates, the degradations and artifacts of those codecs are universally recognized (through years of familiarity with telephones) as compression artifacts and not innate features of the speaker themselves. Therefore listeners can be expected to be sympathetic to the quality issues and not attribute the whole of the sound on the speaker's person. If AI-trained voice synthesizer codecs become the norm, and it performs well on most speakers, that expectation will go away, and the resulting audio will be attributed wholly to the speaker. That increases the impact of mistakes and misrepresentations introduced by the codec, unbeknowst to the speaker and listener. [1] https://ai.googleblog.com/2021/02/lyra-new-very-low-bitrate-... [2] https://news.ycombinator.com/item?id=26282519 |
I honestly don't hear a 'th' in the original.
> It was surprising, given that the speaker doesn't sound like that in the original.
I disagree. Note that the speaker says "these bread". The three possibilities for those two words—"these bread", "thiiiis bread", and "these breads" with a dropped "s"—would all be weird things for a native english speaker to say for different reasons relating to either wrong pronunciation of "this" or "breads" or the fact that bread is its own collective noun and therefore we typically require separate qualifiers like "these buns" or "these loaves" when separating multiple individual "pieces" (another) into a non-collective. We ask for "some bread" or "a piece of bread", but we don't say "a bread" or "some breads" unless we are discussing categorical types of bread ("ciabatta and rye are breads") rather than instances of such, and only one type of bread is represented in the video.
The Lyra reproduction has a band-pass filtered quality to it, but I find it still remarkably representative of the reference.