Hacker News new | ask | show | jobs
by vessenes 476 days ago
This is primarily architecturally interesting in my opinion. Output songs have unusual noticeable artifacts, and I would guess they become more noticeable the more you listen.

That said, wow. An end to end FAST architecture that can infer a 4.5 minute song in 10 seconds is a compelling thing. I didn’t see if we got open weights, but my guess is that this is not crazy challenging to train, and some v2/v3 versions of this are likely to be good-to-very-good.

1 comments

The huge missing issue is direction. Songs are way more than just a 10 second style reference and lyrics. Even the most generic pop song from the 90s had recognizable choruses some repeated bars and some ebb and flow to the song that connected to the lyrics to make it interesting to the human ear. Right now the generated songs, as you noted, somewhat glitchy lyrics over a bland backing track that just sort of goes at one speed and note for the whole of the lyrics.