Hacker News new | ask | show | jobs
by nimski 971 days ago
You're right. I think it's fair to carve out fiction from my argument. For that, I would surely go to the source material until the point where the LLM was coming up with better long-form fiction de-novo. But for non-fiction, which I would guess is the economically and intellectually more important category to protect, the effects may be devastating.

I also agree that attribution can't be solved easily in the current paradigm. Perhaps, during training, one could deduce how much of the net gradient on a particular weight was derived from the batches covering some book, and then during inference, assign attribution based on the effect of that weight on the output. All of this is very expensive to do, and I don't have strong intuitions for whether the resulting attributions would be in any way meaningful.

To your point about hallucinations, if there's not a solution to that, then perhaps the whole point is moot when, after a while, the hype dies down. But if somehow hallucinations are solved (I don't see a technical way this can happen now, but who knows?), then I think we'll need to address attribution for non-technical material.

1 comments

My impression is that attribution on limited datasets isn't terribly hard. If you can prompt the LLM to say a sentence that is approximately in the source material, then the nearest sentence vector in the source material can be looked up in a vector DB, which can attribute it in context.

I think this might be one of the few places where LLMs can provide straightforward value, since it can work as a search engine that can accept vague queries, create approximate answers, fetch the real answers, translate the source material into layman's terms with citations, and allow the newly informed user to refine or dig deeper with that context. The most dangerous part is translation, and the data I've seen show that transformers almost never hallucinate on tasks where no external knowledge is needed.