|
|
|
|
|
by prvc
1251 days ago
|
|
>Four-part chorales are presented to the network as 4-channel images. As in Stable Diffusion, a U-Net is trained to predict the noise residual. >After training the generative model we add 12 channels to the inputs, with the middle four channels representing a mask, and the last four channels are masked chorales. We mask the four channels individually, as opposed to Stable Diffusion Inpainting that use a one-channel mask. How were they encoded, specifically? Anyway, it's fairly easy to break, say, try with "c'4 c'#4 d'4 d'#4 e'4 f'4 f'#4" as the melody. |
|