| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Lucasoato 1236 days ago
	Does anyone know if these models can output also Midi instead of plain audio?

3 comments

albertzeyer 1236 days ago

This model is designed to output raw audio.

However, there are many models which do output midi. That's actually much simpler, and has been done already a few years ago.

I thought OpenAI did this. But then, I might misremember, because their Jukebox actually also seems to produce raw audio (https://openai.com/blog/jukebox/).

Edit: Ah, it was even earlier, OpenAI MuseNet, this: https://openai.com/blog/musenet/

However, midi generation is so easy, you even find it in some tutorials: https://www.tensorflow.org/tutorials/audio/music_generation

link

kolinko 1236 days ago

Not out of the box, afaik. They produce spectograms that get converted into wav/mp3.

link

dimatura 1236 days ago

I think that description applies to Riffusion, one of the earlier models in this area that was a pretty straightforward to adapt image-based diffusion models to making music, since you can treat spectrograms as images. But this model uses "soundstream", which is another model that has its own paper. It's described as a "neural audio codec" which, by itself, is a model that encodes and decodes audio into "tokens"; so sort of like other codecs (eg, MP3) except that the compressed representation it uses is a more high-level learned representation. This model outputs the tokens which are then decoded by soundstream. The tokens probably encode a lot of the same kind of spectral information contained in spectrograms (or similarly, mel-frequency features) but seem to be a little bit more expressive/data efficient.

link

wokwokwok 1236 days ago

No. They can’t.

You could train a model that could, but these models can’t.

Paper: https://google-research.github.io/seanet/musiclm/examples/

Quote: “By relying on pretrained and frozen MuLan, we need audio- only data for training the other components of MusicLM. We train SoundStream and w2v-BERT on the Free Music Archive (FMA) dataset (Defferrard et al., 2017), whereas the tokenizers and the autoregressive models for the seman- tic and acoustic modeling stages are trained on a dataset con- taining five million audio clips, amounting to 280k hours of music at 24 kHz.”

Tldr: you can only get out of these models what you put in, and these ones are trained on raw audio.

If you want midi output, you need to train a model on midi data.

link