> Align text to voice (the hardest part) using some private api
That's also the part that would be most interesting to have explained. Is it language-agnostic? After all, the title says "in any language", but I can't think of any text-audio alignment algorithms that don't require a language-specific model. (Unless you just count characters and assume they map linearly to time, which I'd expect to go very badly.)
Having worked for many years in a linguistics research lab where we spent a lot of money paying people to edit and align subtitles and audio transcripts, and having largely written what was at the time the most sophisticated subtitle-and-transcript editing tool available, I can confirm: counting characters and mapping them linearly to timespan, even after isolating vocals, does indeed go very poorly. And much worse when there's singing involved.
Alternately, since you say speech recognition isn't "even close", I might try going the other way--doing text-to-speech on the audio stream, attempting to align the two speech tracks, and the back-porting the timecodes from audio alignment onto the text.
But that seems a lot more complicated... so, unlikely.
A way to cheat that would probably work good enough most of the time would be to spectrographic analysis on the audio stream to identify syllables, and then similarly just count syllables in the known text and line those up. That works better the more consistent your spelling system is, though, and still requires language-specific modelling. If you actually want to do a decent job cross-linguistically, you'd need in the general case a dictionary for every supported language listing syllable counts for each word (because not everybody's orthography is transparent enough to make simple models like counting character sequences work).
If you actually have a fully language-agnostic algorithm for aligning text to audio that's actually decently accurate, though, that's gotta be worth at least a Master's degree in computational linguistics, 'cause on the face of it it doesn't seem to me (who has such a Masters degree) that it should even theoretically be possible.
You are close enough, so I have to respect my word. I’m not a genius, just a lego builder, I’ve tried a lot of methods, from DL to ML but aeneas project (with some optimizations) gave me the best results. Amazing project and even better personality. Take a look at https://github.com/readbeyond/aeneas
Together with espeak-ng, you can get good results for line level alignment for 108 languages.
The way I'd do it is to use an existing speech recognition system with a large number of language models available (like CMU Sphinx--but probably not CMU Sphinx, 'cause I don't think there are decent openly-available models for 108 different language for Sphinx; maybe MicroSoft's Azure speech to text API or IBM's Watson speech recognition or something like that) to produce a rough transcript with timecodes, and then meet in the middle--use the timecodes from speech recognition, and the known-good text from whatever lyrics you already found, and reduce it to a text-to-text alignment problem so you can match up the ASR timecodes to the known-good text. First pass, I'd probably try an LCS match on the two text streams, but if that wasn't good enough, I'm sure there are better algorithms in the bioinformatics literature.
Examining the source, it looks like alignment is done via an HTML form data submission to 'https://api.audioai.online/split-align'. Manually visiting that website, however, is not very informative... the entire text of http://audioai.online is
Audio AI API
Split voice from audio
Sync voice to text
contact
Search your query in YouTube using https://github.com/youkaclub/youka-youtube
Search lyrics using https://github.com/youkaclub/youka-lyrics
Split the vocals from instruments using https://github.com/deezer/spleeter
Align text to voice (the hardest part) using some private api