Exactly. Qwen only has one pitch accent for pure hiragana words, even though it actually work (removing mandarin mixed-in), which requires some great efforts to normalize text in order to disambiguate heteronyms, the result is (if you use voice cloning) your favorite CV speaking in some weird, unknown accent :)
That got me wondering if "you convert to hiragana" is a solved task, or a research team and five years[0], and Google showed me an article[1] that gave me a facepalm, quoting from Google Translate(square brackets are mine):
> - As a result,
> - When the string "明日["tomorrow"]" is entered into TTS, the TTS model [・皿・] outputs an ambiguous pronunciation that sounds like a mix of "asu" and "ashita" (something like "[asyeta]").
> From this, we found that by using the proposed method, it is possible to obtain data from private data in which the consistency between speech, graphemes, and phonemes is almost certainly maintained for more than 80% of the total.
> Another possible cause is a mismatch between the domain of the training data's audio (all [in read-aloud tones]) and the inference domain.
My resultant rambling follows:
1. Sounds like general state of Japanese speech dataset is a mess
1.1. they don't maintain great useful correspondence between symbols to audio
1.2. they tend to contain too much of "transatlantic" voices and less casual speeches
2. Japanese speakers generally don't denote pronunciations for text
2.1. therefore web crawls might not contain enough information as to how they're actually pronounced
2.2. (potentially) there could be some texts that don't map to pronunciations
2.3. (potentially) maybe Japanese spoken and literal languages are still a bit divergent from each others
3. The situation for Chinese/Sinitic languages are likely __nowhere__ near as absurd, and so Chinese STT/TTS might not be well equipped to deal with this mess
4. This feels like much deeper mess than what commonly observed "a cloud in a sky" Japanese TTS problems such as obvious basic alignment errors(e.g. pronouncing "potatoes" as "tato chi")