| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nimski 972 days ago

"The difference, when it comes to AI, is one of scale. ChatGPT can “read” more published words in a few seconds than I could in several lifetimes and, unlike me, that data isn’t immediately replaced in my human-limited short-term memory by whatever I’m thinking of next."

I think this misses the point. The issue of scale isn't on the ingest side, it's on the output side. Once you train an LLM on a book (however long that takes), then the LLM can be the interface to that book for an unlimited number of users. That scales very differently to, say, a person reading a book and writing something influenced by it.

In the case of the LLM, it's a complete interface to the contents of the book. It lets you "talk to the book". If that exists, why would anyone buy the book? If I could ask ChatGPT to "summarize the new book by XYZ", then spend an hour or two asking the questions _I_ have about the book from it, then buying the book would be a net negative.

If we don't solve attribution (like BMI solved for music), then the financial upside of publishing might be majority-captured by whoever trains LLMs on the copyrighted material.

5 comments

hnfong 971 days ago

By your argument, writing summaries of books, "explain <book> in 3 minutes" youtube videos, and commentaries on books should be made illegal too.

Or more precisely, they should be made illegal if and only if they achieve "scale" of maybe at least a couple million viewers.

The fundamental premise of copyright is flawed. Taking medieval concepts involving censorship of the printing press and extending them to the 21st century is bound to produce awkward results. I'm not hopeful that copyrights will be reconsidered from the ground up during this AI shock, but at least we shouldn't pretend that any arguments about copyrights should be reasonable and make sense. I honestly believe a "realpolitik" approach is more helpful, at least we know that those with more political influence and spend more effort lobbying will probably "win" in the end...

nimski 971 days ago

I appreciate the example, but here's where I think it differs as an analogy of what LLMs do:

A summary doesn't have infinite or variable depth. If you read the summary of a non-fiction (I'll limit my argument to that, as another poster pointed out) book, and either aren't convinced, or want to learn more about the matter, you'd have to purchase the book.

An LLM that has ben trained on the book, if somehow designed not to hallucinate, would be able to answer any question you have about the book at any depth, seamlessly blending in material from other books to answer a question or explain a concept. That seems like an entirely better experience than reading the book from start-to-finish. I don't see how the original can compete.

__loam 971 days ago

LLMs will never be able to not hallucinate. Also, it's insane to me that people like you would prefer to ask a chatbot about a book rather than read the book itself. Part of the value of books is the voice of an author.

passion__desire 971 days ago

This is a very basic and naive, poor scenario. Words are in public domain. But somehow their arrangements makes all the difference. Can AI just solve this "arrangement" problem better than humans do. Arrangements can be liked to series of moves in chess and AlphaGo solves for this through selfplay given only the rules.

chrisdbanks 971 days ago

Assuming you know the right questions to ask. Most people don't know what they don't know. I've tried this. I'd prefer to pay a small amount to read the book.

nimski 971 days ago

Is that really that much of a barrier? Off the top of my head: you could start with a prompt like "write a summary of book XYZ, followed by a summary of each chapter". Then dive deeper into each one from there using the same prompt recursively, etc.

bugglebeetle 971 days ago

Yes, it’s a tremendous barrier, which is why academic fields tend to have introductions or surveys of the domain, stepwise instruction, with more in-depth or specialist knowledge being premised on having this understanding. I feel like this should be obvious to everyone? Did your education not follow such a progression?

WalterBright 971 days ago

Summaries are fair use.

hnfong 971 days ago

> if somehow designed not to hallucinate

We know this isn't possible at the moment. Are we going to legislate for something that is not yet technologically possible? Should judges decide cases because maybe ML researchers will figure out how to reliably stop models from hallucinating?

pjfin123 971 days ago

Under U.S. copyright law you can't copyright facts, only the artistic expression.

Zambyte 971 days ago

Unless we're talking about source code of course.

jkaplowitz 971 days ago

Even for source code, the US does not offer copyright to programs that are simple enough to have effectively one way to accomplish the desired function rather than requiring creative (aka artistic) choices by the programmer.

Example program specification for which the straightforward implementation in any common programming language would not be copyrightable by itself without adding additional scope: “When executed, output ‘Hello, world!’ plus a new line character to standard output, and then exit returning exit code 0.”

Zambyte 971 days ago

> Even for source code, the US does not offer copyright to programs that are simple enough to have effectively one way to accomplish the desired function

I think the fact you use the word "function" here is extremely telling. Writing code is obviously in a closer intellectual domain to designing a car engine, than it is to drawing a picture.

Maybe you have ground to stand on when talking about things like code golf which could be analogous to poetry. But no, the vast majority of code is not the product of artistic expression. It is the product of functional desires.

Not sure why you're trying to make an argument about trivial software. The same is true about trivial art: draw a black square on a white canvas. Good luck claiming copyright for that.

jkaplowitz 971 days ago

I disagree that the vast majority of code lacks artistic expression, especially when using the inclusive sense of the word “artistic” (or often “creative”) that the law uses to determine copyrightability.

There are so many different styles and designs when implementing any nontrivial underlying functional specification, and the preferences, choices, skill, and aesthetic of individual programmers definitely shine through. The ways you and I would find it straightforward to write a given program and the way I would write the same program are very probably recognizably different, beyond purely functional programs like the one I gave.

The existence of an underlying functional desire does not change the necessary artistic element in how to achieve that desire. Even in the traditional art world, an underlying functional desire is often more present than you think. Many artworks throughout history and even today are in fact commissioned, whether explicitly per-piece or through a patronage or employment relationship. A commissioned artwork is trying to satisfy either the specifications or the desires of the client. And among those which aren’t commissioned, like personal photographs, the underlying desire is often a functional one of remembering an occasion, despite the many clearly copyrightable artistic choices and skill required to create the work.

The black square on a white canvas example could very well be copyrightable, and I’d even guess that it usually is. Your functional specification still leaves the artist much freedom to choose the dimensions, relative positions and angles, exact shades of color, and materials of both the black square and the white canvas, as well as the shape of the canvas. Many ways to do it - and, importantly, no obvious one straightforward way to do it as there is in my trivial programming example.

drdaeman 971 days ago

> If we don't solve attribution

It can't be solved, by design. We want LLMs to behave naturally. Humans, naturally, don't provide any attribution, unless it really matters for the conversation.

No one (except for the copyright holders) wants LLMs to be a marketing department's dream, something straight out of cyberpunk novels, spewing brand names(tm) non-stop.

> then buying the book would be a net negative

Surely this is not true. At least for the fiction, people read books instead of their short summaries, because they want to spend time enjoying the story. That's why people are so against any spoilers.

> It lets you "talk to the book". If that exists, why would anyone buy the book?

Interactive and non-interactive experiences are two different things. Although, for sure, after a good book, I'd surely enjoy a "what-if" or "explain that" chat with an LLM (here, a possible business model for rightholders). But a chat cannot replace a story.

For a non-fiction, I probably might enjoy a brief summary first. That's why science papers start with an abstract, anticipating the reader's needs. But even then, if I'm interested, I will probably need full unabridged text to get into the exact details (without LLMs hallucinating me anything).

bugglebeetle 971 days ago

I’m confused why you claim attribution is somehow “unnatural”? Every actually useful lecture, essay, report, etc. I’ve encountered included things like footnotes, references, or a bibliography. So much so, in fact, that I tend to disregard things that don’t include them. So-and-so claims X. What are their sources? There are none? Who cares. Life is too short to engage with arguments that lack rigor or support, even though these things themselves require verification!

MacsHeadroom 971 days ago

Life is too short for me to engage with your argument, because you've failed to attribute the first writers of sentences / ideas semantically similar to each of the lines in your comment.

bugglebeetle 971 days ago

Ah, you’re right. My mistake. I should’ve simply claimed it’s natural to cite sources instead. After all, there is no debating what is natural or those who are simple.

drdaeman 971 days ago

My apologies, my perception of LLMs is somewhat skewed, because I primarily think of conversation agents.

It's unnatural in a conversation. When we're talking about, say, Superman, we don't ever say that it's "a registered trademark of DC Comics, Inc." With obligatory exceptions for comical or satirical effects, or if we're specifically talking about trademarks or copyrights, etc. And of course when we're talking about robots we don't normally give any nods to Karel Čapek.

I believe that, same as humans, LLMs already try to provide references when requested, or if the style/format (such as lecture) prompts for having them. Just remember that famous anecdote where a lawyer used ChatGPT and it wrote a speech and provided believable references (then judge threw this out of court because quality/reliability is another problem - which is out of scope, though).

nimski 971 days ago

You're right. I think it's fair to carve out fiction from my argument. For that, I would surely go to the source material until the point where the LLM was coming up with better long-form fiction de-novo. But for non-fiction, which I would guess is the economically and intellectually more important category to protect, the effects may be devastating.

I also agree that attribution can't be solved easily in the current paradigm. Perhaps, during training, one could deduce how much of the net gradient on a particular weight was derived from the batches covering some book, and then during inference, assign attribution based on the effect of that weight on the output. All of this is very expensive to do, and I don't have strong intuitions for whether the resulting attributions would be in any way meaningful.

To your point about hallucinations, if there's not a solution to that, then perhaps the whole point is moot when, after a while, the hype dies down. But if somehow hallucinations are solved (I don't see a technical way this can happen now, but who knows?), then I think we'll need to address attribution for non-technical material.

TJSomething 971 days ago

My impression is that attribution on limited datasets isn't terribly hard. If you can prompt the LLM to say a sentence that is approximately in the source material, then the nearest sentence vector in the source material can be looked up in a vector DB, which can attribute it in context.

I think this might be one of the few places where LLMs can provide straightforward value, since it can work as a search engine that can accept vague queries, create approximate answers, fetch the real answers, translate the source material into layman's terms with citations, and allow the newly informed user to refine or dig deeper with that context. The most dangerous part is translation, and the data I've seen show that transformers almost never hallucinate on tasks where no external knowledge is needed.

WalterBright 971 days ago

> without LLMs hallucinating me anything

That's the trouble with LLMs. You cannot rely on what it is regurgitating.

dbtc 971 days ago

> If that exists, why would anyone buy the book?

Because (if the book affords it), reading can be a form of psychic traveling. A reader enters an altered state of consciousness, lives in the world of the book, and comes back changed.

A summary of the information and 'plot points' would seem like a replacement only for those who have never really been absorbed in reading a book.

two_in_one 971 days ago

> why would anyone buy the book? If I could ask ChatGPT to "summarize

Summary is not the same as reading the book. Anyone can read reviews, human written summaries, on internet, or even some analyses, instead of asking AI model or buying a book. AI model usually cannot reproduce even small fragments. But it can indefinitely 'creatively' fantasize in books universe. Usually messing facts and mixing it all with other books.

By the way model doesn't have to be as big as ChatGPT. Anyone with good gaming GPU can get an open source free model and train it for academic research. Mixing fantasy with something else can produce interesting results.

Alifatisk 971 days ago

> If I could ask ChatGPT to "summarize the new book by XYZ", then spend an hour or two asking the questions _I_ have about the book from it, then buying the book would be a net negative.

I believe you lose some data when doing so, summarizing is good when you want to get the gist of it, but not good when you want the actual details.

I know, this sounds very obvious but some people seriously jump to a summary directly and believe that is enough when they research.