Hacker News new | ask | show | jobs
by prvc 1251 days ago
>Four-part chorales are presented to the network as 4-channel images. As in Stable Diffusion, a U-Net is trained to predict the noise residual.

>After training the generative model we add 12 channels to the inputs, with the middle four channels representing a mask, and the last four channels are masked chorales. We mask the four channels individually, as opposed to Stable Diffusion Inpainting that use a one-channel mask.

How were they encoded, specifically? Anyway, it's fairly easy to break, say, try with "c'4 c'#4 d'4 d'#4 e'4 f'4 f'#4" as the melody.

1 comments

There was a typo in the readme, thanks for pointing this out! I add 8 channels (4 mask + 4 masked chorales). The chorales are transformed into 4-dimensional arrays, each channel representing a part of the piece. I've added some example plots to the readme to illustrate.