I actually have very little direct experience with automated forced alignment; I have enough experience in the space to know that naive approaches suck, but back when my boss was paying people to do manual alignment most of the effort went into second-language subtitles for pedagogical studies... which means the text doesn't actually represent the same words that are in the audio, because they're words in a different language, and nothing would do a good job of accurately aligning that! So I got very little support for building in a more sophisticated auto-alignment system.
My intuition, however, is that a meet-in-the-middle approach using automatic speech recognition and then aligning the resulting text streams would be the optimal approach, and indeed every other major forced-alignment tool besides aeneas (https://github.com/pettarin/forced-alignment-tools) does seem to use that approach. The catch, of course, is that you actually need decent ASR language models for every target language to make that work, and gas you can see from tat list, it is rare for any given engine to support more than a few languages; CMU Sphinx probably has the widest support, although it's not the highest end toolkit for popular languages like English. So, if you really want to maintain the broadest possible language support, and you can afford the API fees, building a new alignment engine that piggy-backs on MicroSoft or IBM's speech recognition APIs is probably the best option--or, to keep it cheap I'd go ahead and use Sphinx's aligner as a preferred option for all the languages that it has models for, and either fall back on aeneas for remaining languages, or (if you can afford occasional API calls to commercial services for the occasional less-popular language) upgrade to MicroSoft/IBM services for the remaining languages.
I’ve tested every single ASR alignment solution that mentioned here https://github.com/pettarin/forced-alignment-tools, but they all performed poorly compared to Aeneas, even with good language models (English)
My little secret hack for better tts results is to make the singing sounds like speaking.
Currently I’m using Sox pitch filter, do you have another idea how to achieve that?
Sorry, that's where my expertise runs out. I could tell you all about analyzing the linguistic structure of the text, but my experience with audio processing is limited to reading spectrograms and trusting other people's ASR tools.
My intuition, however, is that a meet-in-the-middle approach using automatic speech recognition and then aligning the resulting text streams would be the optimal approach, and indeed every other major forced-alignment tool besides aeneas (https://github.com/pettarin/forced-alignment-tools) does seem to use that approach. The catch, of course, is that you actually need decent ASR language models for every target language to make that work, and gas you can see from tat list, it is rare for any given engine to support more than a few languages; CMU Sphinx probably has the widest support, although it's not the highest end toolkit for popular languages like English. So, if you really want to maintain the broadest possible language support, and you can afford the API fees, building a new alignment engine that piggy-backs on MicroSoft or IBM's speech recognition APIs is probably the best option--or, to keep it cheap I'd go ahead and use Sphinx's aligner as a preferred option for all the languages that it has models for, and either fall back on aeneas for remaining languages, or (if you can afford occasional API calls to commercial services for the occasional less-popular language) upgrade to MicroSoft/IBM services for the remaining languages.