| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mwcampbell 2245 days ago
	> In addition to conditioning on artist and genre, we can provide more context at training time by conditioning the model on the lyrics for a song. A significant challenge is the lack of a well-aligned dataset: we only have lyrics at a song level without alignment to the music, and thus for a given chunk of audio we don’t know precisely which portion of the lyrics (if any) appear. We also may have song versions that don’t match the lyric versions, as might occur if a given song is performed by several different artists in slightly different ways. Additionally, singers frequently repeat phrases, or otherwise vary the lyrics, in ways that are not always captured in the written lyrics. I wonder if karaoke videos would be a useful source of data here. Granted, karaoke tracks are usually covers, but some of them are very faithful to the original.