Hacker News new | ask | show | jobs
by poslathian 1289 days ago
Super! Makes sense since Skydio is also amazing.

How much data is used for fine tuning? Since spectrograms are (surely?) very out of distribution for the pre training dataset, how much does value does the pre training really bring?

1 comments

To be honest, we're not sure how much value image pre training brings. We have not tried to train from scratch, but it would be interesting.

One thing that's very important though is the language pre-training. The model is able to do some amazing stuff with terms that do not appear in our data set at all. It does this by associating with related words that do appear in the dataset.

Hi, I really admire the skill you put at work on this project. At the same time, I think everyone is overlooking how crucial and problematic the training factor is.

Why was stable diffusion able to generate spectrograms? Because it was fed some. Presumably, those original spectrograms were scraped with little concern over creators' permissions, just like it has been for artists' work in order to produce art-looking image generation. Please, research what has been happening in the art community lately. https://www.youtube.com/watch?v=Nn_w3MnCyDY

A protest on ArtStation has been shown to influence Midjourney's results, proving that huge amounts of proprietary work are constantly scraped without the creators' permission. AIs like these work so well just because they steal and remix real artists' work in the first place. There are going to be legal wars about this.

Stable Diffusion doesn't have an official music generation Ai precisely because it couldn't train it with the same approach without being sued by music labels right away, while isolated artists don't have the same power.

So, back to my question: have you wondered whose work is Stable Diffusion remixing here? Your endeavour is great technically, but as we progress into the future we have to be more aware of the ethical implications that come with different forms of progress.

You could try to base your project on a collection of free-to-use spectograms, and see how it performs. If you do, I think it could actually be very interesting and useful to discuss the results here on Hacker News.

Cheers!

What I would really like to know - what happens if one trains that model from scratch (or is that not possible and training requirements are different? Sry for my ignorance, I never fine-tuned any diffusion model before)?

In my experience (CNN based imagery segmentation) proven architectures (e.g. U-Net) performed similar with or without fine-tuning existing models (that have been mostly trained on imagenet, citiscapes, etc.) IF the domain was rather different.

At least in the field of imagery segmentation there is not much of a point in fine-tuning an off-the-shelf model on let's say medical imagery.

So maybe it's the same for the stable diffusion model. I don't see how some knowledge about the relationship between the prompt and given imagery describing that prompt should help this model map the prompt to a spectrogram of the given prompt.

You can embed images in spectrograms.. might sound weird though