| HN Mirror

So let’s play, if you can guess the align method I’ll open source it :)

Alternately, since you say speech recognition isn't "even close", I might try going the other way--doing text-to-speech on the audio stream, attempting to align the two speech tracks, and the back-porting the timecodes from audio alignment onto the text.

But that seems a lot more complicated... so, unlikely.

A way to cheat that would probably work good enough most of the time would be to spectrographic analysis on the audio stream to identify syllables, and then similarly just count syllables in the known text and line those up. That works better the more consistent your spelling system is, though, and still requires language-specific modelling. If you actually want to do a decent job cross-linguistically, you'd need in the general case a dictionary for every supported language listing syllable counts for each word (because not everybody's orthography is transparent enough to make simple models like counting character sequences work).

If you actually have a fully language-agnostic algorithm for aligning text to audio that's actually decently accurate, though, that's gotta be worth at least a Master's degree in computational linguistics, 'cause on the face of it it doesn't seem to me (who has such a Masters degree) that it should even theoretically be possible.

You are close enough, so I have to respect my word. I’m not a genius, just a lego builder, I’ve tried a lot of methods, from DL to ML but aeneas project (with some optimizations) gave me the best results. Amazing project and even better personality. Take a look at https://github.com/readbeyond/aeneas Together with espeak-ng, you can get good results for line level alignment for 108 languages.

yorwba 2299 days ago

Good to know that aeneas works reasonably well even for sung speech. I've tried using aeneas for LibriVox audiobooks (10+ hours), which failed because it tries to load the whole file into memory at once and then compute FFTs on it all at once etc., which I don't have the RAM for. So right now I'm Rewriting in Rust™ using iterators to hopefully reduce memory usage and improve performance.

Espeak-ng supporting 108 languages is maybe a bit misleading. They have pronunciation definitions for many languages, but the actual level of support varies widely.

For Mandarin, espeak-ng 1.49.2 has a bug where it reads the tone numbers out loud instead of modifying the pitch contour, so e.g. the number 四 (four) is pronounced si si instead of sì, because it has the fourth tone. That's the version packaged for Ubuntu, so you may be using it for your API.

For Japanese, kanji aren't supported at all, so 四 is pronounced as "Chinese letter" (in English). For proper Japanese support, you'd need to switch to a different TTS engine like Open JTalk or preprocess the text to transform it into kana.

Also note that Aeneas is licensed under AGPL, which requires you to offer the source code if you let others interact with the program over a network (which is what your API does). So your attempt to keep the secret sauce private and only reveal it once someone guessed the algorithm was likely illegal. You should add proper copyright notices to your program and audioai.online

Thanks for your reply, I’ll add copyright notice soon. I didn’t really tried to keep it private, otherwise I could just ignore the question. The reason that I didn’t cite it in the first place, it’s because I’m still testing few alternatives.

Ah! It's not even trying to do word or syllable-level alignment. Well, that makes the problem considerably more error-tolerant. And they specifically call out ASR-based aligners as more accurate, so that makes me feel good about myself! Still, that's a cool project; thanks for pointing it out. I shall have to dig into it and see what they are actually doing.

Even with only line-level accuracy, that would've been nice to have 7 years ago... but I see the first commit to the project is only in 2015. Might still be useful to some of my old colleagues, though; I'll have to see if they've heard of it.

clashmeifyoucan 2299 days ago

I was playing with Aeneas and I didn't really find it THAT accurate, using Syllabification-by-Analogy and then doing some optimizations like matching choruses or verses which are repeated yielded interesting results. When I was doing this then spleeter and demucs and other vocal isolators weren't out so I should probably have another go with those...

Based on your experience, which alignment method/system is the state of the art? (I’m looking for accurate word/syllable level alignment for Youka)

I actually have very little direct experience with automated forced alignment; I have enough experience in the space to know that naive approaches suck, but back when my boss was paying people to do manual alignment most of the effort went into second-language subtitles for pedagogical studies... which means the text doesn't actually represent the same words that are in the audio, because they're words in a different language, and nothing would do a good job of accurately aligning that! So I got very little support for building in a more sophisticated auto-alignment system.

My intuition, however, is that a meet-in-the-middle approach using automatic speech recognition and then aligning the resulting text streams would be the optimal approach, and indeed every other major forced-alignment tool besides aeneas (https://github.com/pettarin/forced-alignment-tools) does seem to use that approach. The catch, of course, is that you actually need decent ASR language models for every target language to make that work, and gas you can see from tat list, it is rare for any given engine to support more than a few languages; CMU Sphinx probably has the widest support, although it's not the highest end toolkit for popular languages like English. So, if you really want to maintain the broadest possible language support, and you can afford the API fees, building a new alignment engine that piggy-backs on MicroSoft or IBM's speech recognition APIs is probably the best option--or, to keep it cheap I'd go ahead and use Sphinx's aligner as a preferred option for all the languages that it has models for, and either fall back on aeneas for remaining languages, or (if you can afford occasional API calls to commercial services for the occasional less-popular language) upgrade to MicroSoft/IBM services for the remaining languages.

clashmeifyoucan 2299 days ago

How's the performance on some of the harder songs to align? When the voice is too melodic or the characteristic high female pitch that can get mistaken for instruments? Something like Royals - Lorde maybe.

I preprocess the vocals using Sox, so the female singing become more like male speaking

The way I'd do it is to use an existing speech recognition system with a large number of language models available (like CMU Sphinx--but probably not CMU Sphinx, 'cause I don't think there are decent openly-available models for 108 different language for Sphinx; maybe MicroSoft's Azure speech to text API or IBM's Watson speech recognition or something like that) to produce a rough transcript with timecodes, and then meet in the middle--use the timecodes from speech recognition, and the known-good text from whatever lyrics you already found, and reduce it to a text-to-text alignment problem so you can match up the ASR timecodes to the known-good text. First pass, I'd probably try an LCS match on the two text streams, but if that wasn't good enough, I'm sure there are better algorithms in the bioinformatics literature.

ampdepolymerase 2299 days ago

Speech recognition?

not even close

ampdepolymerase 2299 days ago

Method that can be algorithmically reduced to the FFT? In Big O terms at least?