Hacker News new | ask | show | jobs
by doctorpangloss 672 days ago
There are no clean image models. Zero. Using today's model architectures, the problem of using non-expressly-permitted data for training is insurmountable. I welcome anyone more knowledgeable on the matter to go ahead and comment about a counterexample before downvoting.

So if the artists prevail, image generators are donezo. Open source, proprietary, whatever. People saying otherwise just don't know enough about how they work.

You have heard of Adobe's Firefly. It is not clean. Adobe uses CLIP, T5, or something for text conditioning. None of those things were trained on expressly permitted content. Go ahead and ask them.

Maybe you have heard of Open Model Initiative. They are going going to use CLIP or T5. They have no alternative.

There are not enough license bureau images to train a CLIP model, not enough expressly licensed text content to train T5. A CLIP model needs 2 billion images to perform well, not the 600m Adobe claims they have access to. It's right in the paper.

Good luck training a valuable language model on only expressly permissioned content. You'd become a billionaire if you could keep such an architecture secret. And then when it does exist, such as with some translation models, well they underperform, so who uses them?

What do people want? I don't really care about IP, I care about, who is allowed to make money? Is only Apple, who controls the devices and accounts, and therefore can really enforce anti-piracy, permitted to make money? Only parties with good legal representation? It's not so black and white, not so cut and dried, who the good guys and bad guys are. We already live with a huge glut of content and raised interest rates, which have been 100x more impactful to the bottom line - financial and creative - of working artists. Why aren't these artists demanding that the Fed drops rates, or that back catalog media be delisted to boost demand for new media? It's not that simple either! Presumably a lot of people using these image and video generators are narrative creators of a kind too, like video game developers, music video makers, etc. Are they also bad guys?

There's no broad solution here, the legal victory here is definitely pyrrhic, but one thing's for sure: Apple, NVIDIA, Meta and Google will still be printing cash. The artists are advocating for a position that boils down to, "The only moral creative-economic status quo is my status quo."

14 comments

That CLIP is not data / sample efficient is well know, and research to improve this is ongoing. Here is a 2021 paper which outperforms a CLIP baseline, with 7x less data. https://arxiv.org/abs/2110.05208 I am sure there are more recent papers also, possibly with larger gains. I do not see why Adobe would not be able to make a good CLIP like model with 0.6 billion images.
> I do not see why Adobe would not be able to make a good CLIP like model with 0.6 billion images.

Unity and Epic have tried and failed to do so. There are lots of talented people out there at companies with lots of money. Adobe, Unity and Epic aren't the only ones with licensing bureau images either. And anyway, did you consider that the vast majority of content in licensing bureaus is garbage? Or that the captions are garbage? Or that maybe they have wildly overstated the number of images they have?

Adobe hasn't published anything about their architecture or approach for the simple reason that it is not clean in the way they advertise their models to be.

Where are you getting 2 billion from? The original CLIP paper says:

> We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. [1]

OpenCLIP was trained on more images, but the datasets like LAION-2B are kind of low-quality in terms of labeling; I find it plausible that a better dataset could outperform it. I'm pretty sure that the stock images Adobe is drawing from have better labeling already.

I agree that this is likely to backfire on artists, but part of that is that I expect the outcome to be that large corporations will license private datasets and open research will starve.

[1] https://arxiv.org/abs/2103.00020

The 400m images in the paper yield the ~40% zero shot ImageNet accuracy in the chart they publish.

That level of performance is generally not good enough for text conditioning of DDIMs.

The published CLIP checkpoints, and later in the paper, they talk about performance that is almost twice as good at 76.2%. That data point, notably, does not appear in the chart. So the published checkpoints, and the performance they talk about later in the paper, are clearly trained on way more data.

How much data? Let's take a guess. I got the data points from the chart they have, and I went and fit y=a log_⁡b (c+dx) + K to the points in the paper:

    a≈12.31
    b≈0.18
    c≈24.16
    d≈0.81
    K≈−10.47
Then I got 7.55b images to get a performance of 76%. The fit is R^2 = 0.993, I don't have any good intuitions for why this is so high, it could very well be real, and there's no reason to anchor on "7.55b is a lot higher than LAION-4b", although they could just concatenate a social media image dataset of 3b images with LAION-4b, and boom, there's 7b.

OpenCLIP reproduced this work after all with 2b images and got 79.5%. But e.g. Flux and SD3 do not use OpenCLIP's checkpoints. So that one performance figure isn't representative of how bad OpenCLIP's checkpoints are versus how good OpenAI's checkpoints are. It's not straightforward to fit, it's way more than 400m.

Another observation is that there are plenty of Hugging Face spaces with crappy ResNet and crappy small-dataset trained-from-scratch CLIP conditioning to try. Sometimes it actually looks as crappy as Adobe's outputs do, there's a little bit of a chance that Adobe tried and failed to create its own CLIP checkpoint on the crappy amount of data they had.

Asking why the artists are mad at the corporations that are trying to profit off their labor without permission and not the fed or other artists is definitely a take.
You are making a bad faith comment. There's no mystery why artists are mad at Stability and Midjourney. I agree that demanding lower interest rates would be ridiculous. That is my point. You could delete Midjourney, Stability, DALL-E3, etc. tomorrow, and it will still suck harder today to be a working artist than it did in 2021, when interest rates were lower and there were literally hundreds more TV series being produced, 2x more video games being made, than today.

Why limit ourselves to turning back the clock on AI, on interest rates and content productivity, if we're going to play time machine fantasies? You could also go back in time and buy bitcoin, and be rich. I am mocking the idea of turning back the clock, and you know it, and while anyone has a right to be angry about anything, and to engage in a time machine fantasy about anything, it ought to at least be a fantasy that makes sense and achieves some goals.

Because the goal right now, "The smallest, most memetic sentiment of I'll show those corporations!" is kind of well-trodden, kind of old and tired. Brother, there are millions of people trying to do that every day. And when they achieve their goals of showing the big corporations, I cannot think of a single instance where all but the already lucky few - like these famous plaintiffs! - gain anything financially.

I appreciate the extent to which you’ve demonstrated whataboutism at its extremes, but I think we can take things even further. Let’s suggest that artists direct their ire at the emergence of life itself from the raw materials of the universe, as that is, indisputably, the origin of all suffering.
> Let’s suggest that artists direct their ire at the emergence of life itself from the raw materials of the universe, as that is, indisputably, the origin of all suffering.

Some artists do.

A keen observation. While artists may be made redundant, I doubt AI will ever achieve the depth of insight you’ve demonstrated in this thread.
> Using today's model architectures, the problem of using non-expressly-permitted data for training is insurmountable… So if the artists prevail, image generators are donezo.

This doesn’t follow. Using 2014’s model architectures, image generators were also impossible, but that didn’t prevent progress. The field is moving absurdly rapidly. Suggesting that because we can’t do it one way today, therefore we can’t to it that way tomorrow is like saying that because we couldn’t do it one way yesterday, therefore we can’t do it that way today.

It’s wild to trample people’s livelihoods because researchers haven’t figured out how not to yet, especially when that kind of research is making such quick progress. I’d rather wait a few years and have the best of both worlds.

> There are not enough license bureau images to train a CLIP model, not enough expressly licensed text content to train T5. A CLIP model needs 2 billion images to perform well, not the 600m Adobe claims they have access to. It's right in the paper.

Not an expert on this, but I wonder:

1) how many images you could create/buy/tag with a billion dollar investment, and

2) if you could lower the training requirements with targeted training data creation (e.g. get low-priced/amateur models to come in singly and in groups for an hour each and work through a catalog of poses/costumes designed to result very good generative model for "people").

I'm sure the artists don't give any care about the parts of the training that aren't directly related to generating images, such as models which generate captions for images.
CLIP is just for an embedding for images and text, right?

I might be getting mixed up… The diffusion part is just trained with the images, and the guidance part… is trained to produce the image when given the additional information of the embedding of the text? I find it difficult to imagine how the information from the CLIP embedding of the text could result in much information about the images that CLIP was trained with, ending up in the generated images?

An understanding of language is important for conveying and achieving intent.

Imagine working with an artist in a multi-step refinement process to produce some desired artwork. Regardless of the artists skill, you'll probably get better results if you're able to communicate well.

That's kinda how the diffusion process works. It starts with noise, generates a rough output, then iteratively refines it. The classifier is part of the refinement process so it knows what to change.

"Hey, you've added a tree-looking-thing on your beach-looking-thing, you should add some palm fronds so it better fits the setting."

> CLIP is just for an embedding for images and text, right?

Yes, which is what makes text-to-image generation possible. You can go ahead and try using Stable Diffusion models, or even the incredibly high quality Flux, with no text "embedding" (or whatever you want to call it), and judge for yourself if those outputs are useful.

I get that, but my question is, “how can using the guidance from CLIP possibly make the resulting image infringe on copyright?”. I’m not saying that the CLIP part isn’t necessary for it to be useful.
The diffusion process is conditioned on CLIP text, which works better (in theory) since the encoded text is aligned with images.
> There are no clean image models. Zero. Using today's model architectures, the problem of using non-expressly-permitted data for training is insurmountable.

"This would be hard to do while respecting licenses on creative works" is not an argument for being permitted to ignore those licenses.

I don't like copyright, but I strongly believe in everyone following the same rules. If AI companies are finding that copyright is inconvenient: welcome to the club, Open Source developers have been saying that for decades, and others have been saying it for centuries. There shouldn't be a special asymmetric exception for AI training that lets AI ignore licenses while everyone else cannot. By all means remove copyright restrictions for everyone, for all uses.

> So if the artists prevail, image generators are donezo.

And for exactly that reason I hope they prevail. Model training can start over and do it right this time.

It was very surprising OpenAI wasn't named as a defendant in this suit due to CLIP.
The plaintiffs barely understand how any of this stuff works. The judge barely understands how this stuff works.
Imagine OpenAI put all their code and all their work in a public repo so someone can modify it and sell it without permission. Oh wait... they wouldn't do that.

> Presumably a lot of people using these image and video generators are narrative creators of a kind too, like video game developers, music video makers, etc. Are they also bad guys?

Was their a dearth of video games or music videos before generative AI became mainstream? Yeah, creating takes resources and time and effort and dedication, usually for very little reward.

If these companies can't exist without stealing everyone else's work than maybe they should hire creators with their billions or license the material.

The level of cleanliness you talk about matters for FOSS people like us. The kinds of risks Adobe's Firefly customers might care about might be lower. They probably don't care that the model knows what the text string "C3-PO" means, but absolutely don't want it drawing random bits and pieces of other copyrighted images without being prompted for them.

My understanding was that CLIP handled prompt comprehension - like, there's a set of vectors in CLIP space for "gold humanoid robot" that "C3-PO" would map to from the small language model, and pictures of C3-PO would map to from the image model in CLIP. But the U-net doing the actual image diffusion wouldn't know how to fill that part of CLIP space with the specific copyrightable representation of the Star Wars character unless it'd been trained on the same set of images. It might generalize how to draw a gold robot, which is not a copyrightable image feature, but not C3-PO specifically.

It's entirely plausible that a court might say training CLIP on copyrighted material is OK, but training the VAE or U-net layers is not, based on the technical capability of each layer to reproduce trained-on material.

The moral arguments being bandied about by artists are broader than copyright. Firefly - or even a fully public-domain-trained model - cannot satisfy them. Being trained on is a moral insult, but they would still be insulted by AI bros and corporate stooges boasting about how AI can eliminate entire classes of artistic work. To be clear, the AI models we currently have - as well as those we will have in the future - are not useful tools for artists. The problem is not a lack of training data or the provenance of said data, it's the fact that text is not a good interface for visual artists.

It is, however, a very good interface for people who want artists to go away. What AI art is doing in 2024 is satisficing - i.e. providing viewers and users of art with a good-enough market substitute.

The bigger questions you raise about ownership are orthogonal to the questions of who gets to own the model. The artists opposing AI rightfully want to see tech companies bleed, because tech companies are the same companies who sold their bosses on the tools that steal their wages - e.g. streaming services that pay fractions of a cent if you're lucky. If AI were to prevail the alternative would then be to engage in copyright laundry in protest. e.g. "If you won't protect us against AI, then we'll weaponize it against the media conglomerates who want to use it to fire us with."

Frankly, I’m not convinced that a world in which generative AIs based on unlicensed data have to shut down is a bad thing. You want to create art, you learn to draw or hire someone who can. You want to create a story, you learn to write or hire someone who can.
> So if the artists prevail, image generators are donezo

Good. If it's impossible to make this particular type of image/whatever (it's not art) generator without exploiting all artists then that it shouldn't be allowed to be made.

I once trained a model from data from a simulator which I wrote myself. I think it's clean.

Just sayin, zero is a strong claim.

Yeah and as much as I may not be a big Adobe fan, they legit hold the rights to plenty of "clean" IP-compliant training material (OPs comment re generative text not withstanding)
Adobe also trained on output of midjourney

https://www.cdpinstitute.org/news/adobe-firefly-partly-train....

> (OPs comment re generative text not withstanding)

That's like saying, "Not withstanding the part of this that is true, but would be inconvenient to the idea that Adobe has something invaluable."

You can't train a useful text-to-image model without some kind of text conditioning approach. All the existing text conditioning approaches cannot be developed using only the data they have. How else can I put this?

The whole insight here is that the idea of "clean" is already kind of magical, that people want "clean" image models but they don't really understand the meaning of "clean" - or rather, nobody wants to take leadership in educating how these models work. People want good vibes, aesthetically pleasing "clean" image generators, not actually technologically clean image generators.

But this court case would outlaw the good vibes "clean" generators, and since there are no technologically clean image generators, that's it for image generators.