Hacker News new | ask | show | jobs
by nonbirithm 1225 days ago
This is what happens when you overtrain a model too. Recent developments have allowed partial sets of model weights called LoRAs to be added to the diffusion model. These models can be fine-tuned independently in under half an hour. If you set the learning rate too high, it will start reproducing the source material with extremely high fidelity. This is what overfitting does.

My conclusion is there is an argument to be made for infringement in some cases, but it's based on degrees instead of absolutes. If infringement is defined as "copyrighted works were used in this dataset", then at a certain point (low enough learning rate) it becomes impossible to tell if infringing data was used. You'd be working with weight amounts that are so miniscule they could be rounding errors, yet by that definition would still be infringing.

And since any arbitrary data can be used with some set of keywords, the standard for what constitutes "infringing" changes with each model. As in, it would probably be hard to have a benchmark test that can definitively state "this model violates copyright." Any number of keywords can be trained on to obfuscate the prompt needed to reproduce the data, assuming there was even a high enough LR for the data to be reproduced similarly enough.

I'm unsure if there can ever be one standard for when a set of a bunch of floating point numbers can pass the threshold for constituting infringement. This is applying an absolute standard to a fuzzy algorithm. It's like compressing a JPEG, at some level of compression on the scale a picture of Mickey Mouse becomes unintelligible. But with JPEGs it isn't really useful to have an unintelligible picture of Mickey Mouse. However, it can be extremely useful to have a LoRA with the weights underfit just enough to where the diffusion gives novel outputs.