Hacker News new | ask | show | jobs
by drdaeman 972 days ago
> If we don't solve attribution

It can't be solved, by design. We want LLMs to behave naturally. Humans, naturally, don't provide any attribution, unless it really matters for the conversation.

No one (except for the copyright holders) wants LLMs to be a marketing department's dream, something straight out of cyberpunk novels, spewing brand names(tm) non-stop.

> then buying the book would be a net negative

Surely this is not true. At least for the fiction, people read books instead of their short summaries, because they want to spend time enjoying the story. That's why people are so against any spoilers.

> It lets you "talk to the book". If that exists, why would anyone buy the book?

Interactive and non-interactive experiences are two different things. Although, for sure, after a good book, I'd surely enjoy a "what-if" or "explain that" chat with an LLM (here, a possible business model for rightholders). But a chat cannot replace a story.

For a non-fiction, I probably might enjoy a brief summary first. That's why science papers start with an abstract, anticipating the reader's needs. But even then, if I'm interested, I will probably need full unabridged text to get into the exact details (without LLMs hallucinating me anything).

3 comments

I’m confused why you claim attribution is somehow “unnatural”? Every actually useful lecture, essay, report, etc. I’ve encountered included things like footnotes, references, or a bibliography. So much so, in fact, that I tend to disregard things that don’t include them. So-and-so claims X. What are their sources? There are none? Who cares. Life is too short to engage with arguments that lack rigor or support, even though these things themselves require verification!
Life is too short for me to engage with your argument, because you've failed to attribute the first writers of sentences / ideas semantically similar to each of the lines in your comment.
Ah, you’re right. My mistake. I should’ve simply claimed it’s natural to cite sources instead. After all, there is no debating what is natural or those who are simple.
My apologies, my perception of LLMs is somewhat skewed, because I primarily think of conversation agents.

It's unnatural in a conversation. When we're talking about, say, Superman, we don't ever say that it's "a registered trademark of DC Comics, Inc." With obligatory exceptions for comical or satirical effects, or if we're specifically talking about trademarks or copyrights, etc. And of course when we're talking about robots we don't normally give any nods to Karel Čapek.

I believe that, same as humans, LLMs already try to provide references when requested, or if the style/format (such as lecture) prompts for having them. Just remember that famous anecdote where a lawyer used ChatGPT and it wrote a speech and provided believable references (then judge threw this out of court because quality/reliability is another problem - which is out of scope, though).

You're right. I think it's fair to carve out fiction from my argument. For that, I would surely go to the source material until the point where the LLM was coming up with better long-form fiction de-novo. But for non-fiction, which I would guess is the economically and intellectually more important category to protect, the effects may be devastating.

I also agree that attribution can't be solved easily in the current paradigm. Perhaps, during training, one could deduce how much of the net gradient on a particular weight was derived from the batches covering some book, and then during inference, assign attribution based on the effect of that weight on the output. All of this is very expensive to do, and I don't have strong intuitions for whether the resulting attributions would be in any way meaningful.

To your point about hallucinations, if there's not a solution to that, then perhaps the whole point is moot when, after a while, the hype dies down. But if somehow hallucinations are solved (I don't see a technical way this can happen now, but who knows?), then I think we'll need to address attribution for non-technical material.

My impression is that attribution on limited datasets isn't terribly hard. If you can prompt the LLM to say a sentence that is approximately in the source material, then the nearest sentence vector in the source material can be looked up in a vector DB, which can attribute it in context.

I think this might be one of the few places where LLMs can provide straightforward value, since it can work as a search engine that can accept vague queries, create approximate answers, fetch the real answers, translate the source material into layman's terms with citations, and allow the newly informed user to refine or dig deeper with that context. The most dangerous part is translation, and the data I've seen show that transformers almost never hallucinate on tasks where no external knowledge is needed.

> without LLMs hallucinating me anything

That's the trouble with LLMs. You cannot rely on what it is regurgitating.