Hacker News new | ask | show | jobs
by dangom 1275 days ago
This idea is presented by Jeremy Howard on literally their first Deep Learning for Coders class (most recent edition). A student wanted to classify sounds, but only knew how to do vision, so they converted sounds to spectrograms, fine tuned the model on the labelled spectra, and the classification worked pretty well on test data. That of course does not take the merit away from the Riffusion authors though.
2 comments

The idea of connecting CV to audio via spectrograms pre dates Jeremy Howard's course by quite a bit. That's not really the interesting part here though. The fact that a simple extension of an image generation pipeline produces such impressive results with generative audio is what is interesting. It really emphasizes how useful the idea of stable diffusion is.

edit: added a bit more to the thought

The idea to apply computer vision algorithms to spectrograms is not new. I don't know who first came up with it, but I first came across it about a decade ago.

I just ran a quick Google Scholar search, and the first result is https://ieeexplore.ieee.org/abstract/document/5672395

This is from 2010. I didn't go looking, but it wouldn't surprise me if the idea is older than that.

There were a number of systems designed for composers in the 90s (also continuing through to today) designed for the workflow of converting a sound to a spectrogram, doing visual processing on the image, and then re-synthesizing the sound from the altered spectrogram. Many were inspired by Xenakis' UPIC system which was designed around the second half of this workflow: you'd draw the spectrogram with a pen and then synthesize it.

https://en.wikipedia.org/wiki/UPIC

Edit: my favorite of all these systems was Chris Penrose's HyperUPIC which provided a lot of freedom in configuring how the analysis and synthesis steps worked.