Show HN: Karaoke for any song in any language

  Uncaught Exception:
  Error: Could not get code signature for running application
      at m(/Applications/Youka.app/Contents/Resources/app/.webpack/main /index.js:1:12481)
    at App.<anonymous> (/Applications/Youka.app/Contents/Resources/app/.webpack/main/index.js:1:14365)
    at App.emit (events.js:215:7)

link

youka 2299 days ago

Just reopen it and you will be fine (I don’t have free 99$/year for apple code signature)

link

ronyfadel 2299 days ago

I wish the readme had a description of how Youka works. Looks promising, but I’m not sure it does what I think it does.

link

youka 2299 days ago

I'll add some explanation soon. Here's the main process:

Search your query in YouTube using https://github.com/youkaclub/youka-youtube

Search lyrics using https://github.com/youkaclub/youka-lyrics

Split the vocals from instruments using https://github.com/deezer/spleeter

Align text to voice (the hardest part) using some private api

link

yorwba 2299 days ago

> Align text to voice (the hardest part) using some private api

That's also the part that would be most interesting to have explained. Is it language-agnostic? After all, the title says "in any language", but I can't think of any text-audio alignment algorithms that don't require a language-specific model. (Unless you just count characters and assume they map linearly to time, which I'd expect to go very badly.)

link

gliese1337 2299 days ago

Having worked for many years in a linguistics research lab where we spent a lot of money paying people to edit and align subtitles and audio transcripts, and having largely written what was at the time the most sophisticated subtitle-and-transcript editing tool available, I can confirm: counting characters and mapping them linearly to timespan, even after isolating vocals, does indeed go very poorly. And much worse when there's singing involved.

link

youka 2299 days ago

So let’s play, if you can guess the align method I’ll open source it :)

link

gliese1337 2299 days ago

Alternately, since you say speech recognition isn't "even close", I might try going the other way--doing text-to-speech on the audio stream, attempting to align the two speech tracks, and the back-porting the timecodes from audio alignment onto the text.

But that seems a lot more complicated... so, unlikely.

A way to cheat that would probably work good enough most of the time would be to spectrographic analysis on the audio stream to identify syllables, and then similarly just count syllables in the known text and line those up. That works better the more consistent your spelling system is, though, and still requires language-specific modelling. If you actually want to do a decent job cross-linguistically, you'd need in the general case a dictionary for every supported language listing syllable counts for each word (because not everybody's orthography is transparent enough to make simple models like counting character sequences work).

If you actually have a fully language-agnostic algorithm for aligning text to audio that's actually decently accurate, though, that's gotta be worth at least a Master's degree in computational linguistics, 'cause on the face of it it doesn't seem to me (who has such a Masters degree) that it should even theoretically be possible.

link

gliese1337 2299 days ago

The way I'd do it is to use an existing speech recognition system with a large number of language models available (like CMU Sphinx--but probably not CMU Sphinx, 'cause I don't think there are decent openly-available models for 108 different language for Sphinx; maybe MicroSoft's Azure speech to text API or IBM's Watson speech recognition or something like that) to produce a rough transcript with timecodes, and then meet in the middle--use the timecodes from speech recognition, and the known-good text from whatever lyrics you already found, and reduce it to a text-to-text alignment problem so you can match up the ASR timecodes to the known-good text. First pass, I'd probably try an LCS match on the two text streams, but if that wasn't good enough, I'm sure there are better algorithms in the bioinformatics literature.

link

ampdepolymerase 2299 days ago

Speech recognition?

link

gliese1337 2299 days ago

Examining the source, it looks like alignment is done via an HTML form data submission to 'https://api.audioai.online/split-align'. Manually visiting that website, however, is not very informative... the entire text of http://audioai.online is

  Audio AI API
    Split voice from audio
    Sync voice to text
  contact

link

mycall 2299 days ago

You can use spleeter and align in any audio application.

link

youka 2299 days ago

The question is, how can you do it automatically..

link

Reubend 2299 days ago

Hey there! First of all, I want to tell you that the app is fantastic. I used the earlier version of this, when it was a website, from your previous HN post. And once again the alignment works quite well in my experience, as does the isolation.

In the future, it would be great to have a "portable" version of this for Windows that doesn't install anything. It's annoying to open up an app, and have it install itself without any warning or user consent. You could just release a .zip file with the build as an option.

link

youka 2299 days ago

I’ve considered few options to install ffmpeg, and choose that way. I’m open to other suggestions

link

Reubend 2299 days ago

You can distribute a .zip file which includes the statically linked build of FFmepeg: https://ffmpeg.zeranoe.com/builds/ . Then just call it locally. There's no need to install it system-wide.

link

youka 2299 days ago

I don’t install it system wide, just download a single binary into youka directory.

link

rnotaro 2299 days ago

I get an error when trying to open any video :

Ooops, some error occurred :( Error: [Errno 2] No such file or directory: '/tmp/tmpphtr8ehu/accompaniment.aac'

When running on the official Windows 10 SandBox (https://techcommunity.microsoft.com/t5/windows-kernel-intern...)

Edit: it somehow works for some songs. The concept is really nice. I love it.

link

youka 2299 days ago

Looks like a server-side bug (can't really handle more that a single split process concurrently), I'll add queue in the next version.

link

yunusabd 2299 days ago

Personally I love karaoke, but looking at the repo and the website gave me no information whatsoever about this project. Maybe that's something you can work on? In the meantime I found this article, which reads quite positive: https://www.theverge.com/tldr/2020/2/19/21144452/youtube-you...

link

rnotaro 2299 days ago

Demo: https://peertube.co.uk/videos/watch/3c183b56-deb6-4e6b-a7a2-...

link

youka 2299 days ago

You right! I'll add illustration gif soon.

link

yunusabd 2299 days ago

Cool! So you were originally running it as a webapp, and then decided to open source it? Presumably due to legal reasons?

link

youka 2299 days ago

exactly

link

peterburkimsher 2299 days ago

Is there a way to manually provide the lyrics? I have a substantial collection of songs in Chinese and Taiwanese, and it would be really helpful to use this to help me make lyrics videos for Pingtype. When I tried, I got this error:

Ooops, some error occurred :( Error: name 'espeakng_supported_langs' is not defined

I'll look into aeneas to see if that can give the API-level technical tools that I need - thank you for explaining that part in the other comments!

link

yorwba 2299 days ago

Note that it won't work for Taiwanese (I assume Hokkien) unless you add the necessary support to espeak-ng.

If your lyrics are in Peh-oe-ji, you'll need to define how the romanization maps to phonemes. You may be able to get some inspiration for that from the definitions for Mandarin and Cantonese. Though I just looked at the "phonology" section on Wikipedia https://en.wikipedia.org/wiki/Taiwanese_Hokkien#Phonology and the tone sandhi rules look a lot more complex than any other Sinitic language I know.

If the lyrics use Chinese characters, there's the added difficulty of collecting a pronunciation dictionary, which I'd probably do by scraping https://twblg.dict.edu.tw/holodict_new/index.html , http://xiaoxue.iis.sinica.edu.tw/ccr/ and Wiktionary. (If you know any other sources for pronunciation data, I'm interested.)

link

peterburkimsher 2298 days ago

Yes, I know about romanisation! I wrote Pingtype, and extracted romanisation dictionaries for Taiwanese Hokkien and Hakka by parsing Bible data.

https://pingtype.github.io

Tones are difficult, so I encode those as colours. Adding code to espeak-ng sounds very difficult. Most of the songs are in Mandarin though, so I'll try those first.

link

redraw 2293 days ago

oh, I had the same idea and started working here https://github.com/redraw/karaoke-machine days after Deezer's spleeter was released, but stopped while searching for a way to sync the lyrics. thx! I'll try it out

link

youka 2292 days ago

good luck! here's the relevant code https://github.com/youkaclub/youka-api/blob/master/youka/ali...

link

fareesh 2299 days ago

From what I understand, it is software for you to align lyrics to music contained in a video, with tools to enable you to do so.

link

youka 2299 days ago

Youka aligns lyrics automatically, you have left nothing to do

link

fareesh 2299 days ago

Thanks - that sounds great

link