Hacker News new | ask | show | jobs
by jelled 1122 days ago
I have no idea how the courts will end up resolving this legally. In the meantime I've been thinking a lot about the morality of this alleged theft.

Most of us would agree that duplicating and selling someone's book is immoral. Similarly I think we'd all agree that reading multiple books to learn about a topic and then writing your own is perfectly fine.

So where does an LLM trained on millions of books fall? Personally, I don't find it immoral but I know others will disagree. I'd be curious to hear arguments for the immorality of LLMs trained on copyrighted works.

3 comments

Its similar to many other dilemmas with new tech.

When you are out in public, you don't have an expectation of privacy. People can see you, they can take photos of you. The worker at the cafe will probably remember you and your order. This is fine. But when tech does the exact same thing but with scale where everywhere you go, everything you buy, etc is tracked and analyzed, it's now questionably immoral despite legally being fine.

That's how generative AI is to me. Its doing something people have been doing themselves forever, but now it's doing it faster and easier than ever before which changes the equation. The arguments of "its not real creativity" are a coping mechanism. We are upset that something that was previously quite unobtainable behind years of learning and hours of effort is now trivially accessible to anyone with a computer.

The timescales matter as well.

You or I could perhaps, if we dedicated a few years to it, make a convincing fake video of an acquaintance of ours doing or saying something. By the end of those few years it'd likely be out of date or maybe just irrelevant. And they'd have to have really, really upset us in order for us to put that much effort in. It would literally cost us tens, hundreds of thousands or more in opportunity cost.

Contrast that with some sort of tool that's not too far advanced from current image/video generators that can just do the same in a minute by typing "A video of my next door neighbour accepting cash in a briefcase from a man in a suit".

I like this example and I'd agree that tracking people at scale can be immoral depending on the circumstances. But to me it feels that way because something is being taken from you. You've lost your agency to be anonymous or to leave your past behind.

But I don't see how an LLM training on your works deprives you of something you had before.

> deprives you of something you had before.

it does in some sense - your exclusive knowledge of the subject matter is now transferrable via LLM or some sort of ai model.

For a human to achieve the same, they would've needed to undertake similar amounts of training, effort and dedication as you had. The number of people who would do such is currently small.

So realistically, your value as someone who has this unique expert subject knowledge is diminished.

However, these individual losses are offset by the greater good that the LLM/ai models would generate. It is exactly equivalent to the luddite's arguments about why they would not want the textile machines to replace them.

artists put their artwork online, let people use these in an acceptable range. usually, learning (not copy) from it is acceptable. but there are more controversy around generating million artwork have same personal style, let artists lose job and let their families starve.
Setting up the precedent that training from materials = theft seems pretty scary to me. First because it redefines learning as stealing, and secondly because it is without proving that the source material authors are in someway deprived of something - and in a way that is no different than if a human learnt from their materials and produced content with that knowledge.

Let's say the AI was used to generate illegal content, if these words/images are truly non-transformative and still the property of those from which the model was trained this would be a pretty grim scenario. It seems much more reasonable that the person who prompts the system to build such content would be responsible, and thus the true owner of the output.

For this discussion it's useful to keep in mind that ChatGPT and other AI tools don't spontaneously create content, they create it in response to a human "query". It's also the human who decides whether or not the material is useful and suitable (as it often is not accurate, truthful or useful.)

From here it seems more like a discussion about plagiarism and copyright, but both of these occur beyond the scope of the article. I feel authors haven't taken to this angle because the end materials are reasonably different from the sources (notwithstanding memorisation effects.)

I do agree with the sentiment that ChatGPT isn't intelligent (but AI has never claimed to reproduce true intelligence). I prefer the tongue in cheek description of "spicy autocorrect" as a fairer representation of its capability.

I was thinking about this further as there's a lot of grey area to the idea of who owns the words, after all everyone is using the same words just in different combinations. What about words which aren't in the lexicon, specifically unique trademarks: these are words that are entirely unique e.g. "kleenex" and so on. These words are traceable to the trademark owner.

By the standard that training from materials = theft: Any reference to one of these unique trademarks would be interesting and highly problematic. AI wouldn't be allowed to write any kind of non-editorial text that uses unique trademarked product names without it being criminal.

Instead of debating about the moral and legal aspects, it could be more productive to focus on the technical aspects that can help making more informed decisions about the moral and legal problems.

On some levels of abstraction, LLMs seem to be unknowable black boxes. On other levels, they are simply approximate solutions to established problems, such as estimating the probability distribution for a token following a context.

Genome assembly is kind of like complementary problem to text generation. You can't read the genome directly, but you can duplicate it, break it into fragments, read the fragments, and try to assemble them. The methods vary, but you generally try to find overlaps between the fragments and build a graph based on the overlaps. If you start from a context that occurs only once in the genome, there is often one overwhelmingly likely path in the graph that corresponds to a substantial part of the genome. On the other hand, if the context is too short or it occurs in a repetitive region of the genome, any path you traverse is likely to be chimeric and not correspond to any part of the underlying sequence.

Using similar heuristics, an LLM could estimate whether it's following a long overwhelmingly likely path, replicating substantial parts of the training data, or making choices between substantially different paths, generalizing from the data. And because the training data is usually not that big, it could query the data when it believes it could be replicating the data.