Hacker News new | ask | show | jobs
by dimatura 1236 days ago
I think that description applies to Riffusion, one of the earlier models in this area that was a pretty straightforward to adapt image-based diffusion models to making music, since you can treat spectrograms as images. But this model uses "soundstream", which is another model that has its own paper. It's described as a "neural audio codec" which, by itself, is a model that encodes and decodes audio into "tokens"; so sort of like other codecs (eg, MP3) except that the compressed representation it uses is a more high-level learned representation. This model outputs the tokens which are then decoded by soundstream. The tokens probably encode a lot of the same kind of spectral information contained in spectrograms (or similarly, mel-frequency features) but seem to be a little bit more expressive/data efficient.