Hacker News new | ask | show | jobs
by tkgally 495 days ago
I tried it with a paragraph of English taken from a formal speech, and it sounded quite good. I would not have been able to distinguish it from a skilled human narrator.

But then I tried a paragraph of Japanese text, also from a formal speech, with the language set to Japanese and the narrator set to Yumiko Narrative. The result was a weird mixture of Korean, Chinese, and Japanese readings for the kanji and kana, all with a Korean accent, and numbers read in English with an American accent. I regenerated the output twice, and the results were similar. Completely unusable.

I tried the same paragraph on ElevenLabs. The output was all in Japanese and had natural intonation, but there were two or three misreadings per sentence that would render it unusable for any practical purpose. Examples: 私の生の声 was read as watashi no koe no koe when it should have been watashi no nama no koe. 公開形式 was read as kōkai keiji instead of kōkai keishiki. Neither kanji misreading would be correct in any context. Even weirder, the year 2020 was read as 2021. Such misreadings would confuse and mislead any listeners.

I know that Japanese text-to-speech is especially challenging because kanji can often be read many different ways depending on the context, the specific referent, and other factors. But based on these tests, neither PlayAI nor ElevenLabs should be offering Japanese TTS services commercially yet.

3 comments

So kind of unrelated, but the reading/singing of arbitrary custom lyrics on suno.com's v4 model has blown me away.
suno is uncomfortably good. I run a group for helping founders and sometimes I make little suno songs to accompany the classes for fun, always impressed by what it spits out. (prompt: song for founder who have happy ears bringing them tears > 30 seconds gen >) https://s.h4x.club/p9u4ezl2 / https://s.h4x.club/mXuND7Eb / https://s.h4x.club/L1u2DYzW
Suno songs always have way too much treble or reverb, or something I can't quite put my finger on. They're very bright sounding.

I don't think it's a fatal flaw, but I hope future versions improve on this, or Suno starts doing some more post-processing to address it. I know there's a new "remaster" feature, but I'm not sure if it does anything there either.

Yeah they're way too wide and not muddy enough, if you're gonna be as wide as they often are you need to fill it, else they always just sound over produced/weirdly bright. I was thinking earlier about why they sound really good but not actually good and then I realized that's how I feel about most modern pop music anyway. I think the main thing you're hearing, or at least the thing I find annoying, is if you listen close to how the AI does harmony it seems to almost be cloning the original vocal line and pitching it up and out so it's slightly offset feeling giving the appearance of a second vocalist, its a tick I do in abelton to see what things might sound like build differently with vocals and it feels very much like the sumo fake backing singers. I do think they're like 6 months or so away from nailing a lot of this given how quickly they've been moving, I follow them closely and it's been impressive. (you can also pull meatier stuff out of it if you work it a bit: https://s.h4x.club/04uz6klg - don't think this sounds particularly "AI" at all - edit: turns out if you play up in chamber orchestra and choir in the prompting you can get some much better stuff out of it: https://s.h4x.club/eDubr9xJ)
AI generated music is like AI Art.

It feels really generic. But to be fair a lot of art is just like that.

How many animated shows use the Family Guy laziest common denominator style. Storylines that are written to be easy to follow and mundane.

Ask Chat GPT to write a complex story about divorce and trama. It'll either refuse to it or come up with a Hallmark ending.

I mean yes and no, like if you just let it generate lazily, then yes. However if you work on lyrics and generate a bunch of samples.. no it can be very powerful and artistic.
Alternatively, text that is input to these services should be passed through a normalization process, i.e. use LLAMA to convert kanji to hiragana or a romanization. The TTS output is then much better.
Unfortunately, a simple normalization of kanji --> hiragana throws away pronunciation information.
You could just as easily use the LLM to convert the kanji into phonemes.
You can't lose word boundaries and phonemes don't tell you which part of the word is emphasized.
Modern TTS engines use tokenizers to convert words to phonemes. See: https://github.com/FunAudioLLM/CosyVoice/issues/202
Yeah there are no good options for Japanese yet (except maybe in Japan but I haven't heard of good Ai models for speech locally)
Anecdotally gpt-sovits is quite good at japanese, I can't evaluate first hand as my japanese is trash.