| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ronyfadel 2300 days ago
	I wish the readme had a description of how Youka works. Looks promising, but I’m not sure it does what I think it does.

1 comments

youka 2300 days ago

I'll add some explanation soon. Here's the main process:

Search your query in YouTube using https://github.com/youkaclub/youka-youtube

Search lyrics using https://github.com/youkaclub/youka-lyrics

Split the vocals from instruments using https://github.com/deezer/spleeter

Align text to voice (the hardest part) using some private api

link

yorwba 2299 days ago

> Align text to voice (the hardest part) using some private api

That's also the part that would be most interesting to have explained. Is it language-agnostic? After all, the title says "in any language", but I can't think of any text-audio alignment algorithms that don't require a language-specific model. (Unless you just count characters and assume they map linearly to time, which I'd expect to go very badly.)

link

gliese1337 2299 days ago

Having worked for many years in a linguistics research lab where we spent a lot of money paying people to edit and align subtitles and audio transcripts, and having largely written what was at the time the most sophisticated subtitle-and-transcript editing tool available, I can confirm: counting characters and mapping them linearly to timespan, even after isolating vocals, does indeed go very poorly. And much worse when there's singing involved.

link

youka 2299 days ago

So let’s play, if you can guess the align method I’ll open source it :)

link

gliese1337 2299 days ago

Alternately, since you say speech recognition isn't "even close", I might try going the other way--doing text-to-speech on the audio stream, attempting to align the two speech tracks, and the back-porting the timecodes from audio alignment onto the text.

But that seems a lot more complicated... so, unlikely.

A way to cheat that would probably work good enough most of the time would be to spectrographic analysis on the audio stream to identify syllables, and then similarly just count syllables in the known text and line those up. That works better the more consistent your spelling system is, though, and still requires language-specific modelling. If you actually want to do a decent job cross-linguistically, you'd need in the general case a dictionary for every supported language listing syllable counts for each word (because not everybody's orthography is transparent enough to make simple models like counting character sequences work).

If you actually have a fully language-agnostic algorithm for aligning text to audio that's actually decently accurate, though, that's gotta be worth at least a Master's degree in computational linguistics, 'cause on the face of it it doesn't seem to me (who has such a Masters degree) that it should even theoretically be possible.

link

youka 2299 days ago

You are close enough, so I have to respect my word. I’m not a genius, just a lego builder, I’ve tried a lot of methods, from DL to ML but aeneas project (with some optimizations) gave me the best results. Amazing project and even better personality. Take a look at https://github.com/readbeyond/aeneas Together with espeak-ng, you can get good results for line level alignment for 108 languages.

link

gliese1337 2299 days ago

The way I'd do it is to use an existing speech recognition system with a large number of language models available (like CMU Sphinx--but probably not CMU Sphinx, 'cause I don't think there are decent openly-available models for 108 different language for Sphinx; maybe MicroSoft's Azure speech to text API or IBM's Watson speech recognition or something like that) to produce a rough transcript with timecodes, and then meet in the middle--use the timecodes from speech recognition, and the known-good text from whatever lyrics you already found, and reduce it to a text-to-text alignment problem so you can match up the ASR timecodes to the known-good text. First pass, I'd probably try an LCS match on the two text streams, but if that wasn't good enough, I'm sure there are better algorithms in the bioinformatics literature.

link

ampdepolymerase 2299 days ago

Speech recognition?

link

youka 2299 days ago

not even close

link

gliese1337 2299 days ago

Examining the source, it looks like alignment is done via an HTML form data submission to 'https://api.audioai.online/split-align'. Manually visiting that website, however, is not very informative... the entire text of http://audioai.online is

  Audio AI API
    Split voice from audio
    Sync voice to text
  contact

link

mycall 2299 days ago

You can use spleeter and align in any audio application.

link

youka 2299 days ago

The question is, how can you do it automatically..

link