Hacker News new | ask | show | jobs
by fartsucker69 990 days ago
that's not true. they only ban ai if you don't clearly have usage rights for the data that the ai was trained on. obviously, this includes basically all of the big and popular available AI APIs at the moment.

that sucks for using AI right now, but it's only a question of time until you get huge models trained on CC0 data imo

2 comments

This is functionally identical to “This is true, because Valve bans all existing means of generating AI content for games."

As for your dream of CC0 models, it’s a dream because you’d have to be asleep to believe it. I don’t mean to phrase it so harshly, but there are so many reasons that can’t work. The main one is that there isn’t enough non-copyrighted data to train any competitive model. The competitive models have only recently gotten good enough to just barely be usable, and far more than 90% of their training data was sourced from people they certainly didn’t get usage rights from.

I deleted a paragraph ranting about usage rights. Suffice to say, Stallman’s “Right to Read” becomes more prescient with each passing day.

I'd like to point out that it has been shown that text models can be trained on purely synthetic data and perform at or above the level of models trained on human derived data. This works because you can use an LLM judge the quality of a particular generated sample which allows you to automate the process of picking high quality generations. It won't be long before this is done with generative art as well, a multi-modal model could be used to curate the output of some CC0 derived model and build up a much larger training set for a new model. You could also procedurally create data for training by generating images based on 3D scenes with various shaders applied to give them the look of different art styles. You could also use neural style transfer instead of or in addition to a shader to add more styles of images. You could use the multi-modal model to judge these images as well, selecting only the best. With that, you essentially have a fully automated pipeline for producing any size training set you want 100% synthetic except for the base 3D assets, shaders and example style images which you could source CC0 or buy license to.
I more or less agree with you (I'm not convinced that training models on the imagery of the internet isn't fair use), but I wouldn't rule out a CC0 model just yet.

There's Mitsua Diffusion One [0], which doesn't produce incredible results, but it's a start and they're planning on adding more data, including opt-in work from artists.

PIXART-alpha [1] was trained on only 25 million images, and has excellent and competitive results. This could pair well with Fondant AI's 25 million Creative Commons-only dataset [2] (not all CC0, but a sizeable amount).

I don't think it's as far away as you think it is!

[0]: https://huggingface.co/Mitsua/mitsua-diffusion-one

[1]: https://pixart-alpha.github.io/

[2]: https://huggingface.co/datasets/fondant-ai/fondant-cc-25m

Adobe's generative AI features all use appropriately licensed training data.
And how are you going to prove you were using a model trained only on CC0 data?