Hacker News new | ask | show | jobs
by edude03 905 days ago
I think the intent is really different.

For LLMs you're essentially teaching them language by showing them lots of examples of written language - newspapers are of course a great example of written language.

The goal of OpenAI is not to reproduce newspaper articles verbatim when asked questions (even if the answer could be a newspaper article) and the fact that it can happen is a side effect of how LLMs work.

When a HN participant shares a (pay walled) link to a NYT article, I do want to read the exact article linked verbatim because while the facts of the article may be reproduced elsewhere in a form that's free, specific word choices or whatever might be a focal point of the discussion on HN, and therefore I can't realistically participate in a discussion without having read the article being discussed.

And as an aside, I have no problem with paying to read news, or whatever media, however it's impractical for me to subscribe to every news source HN participants link to, and therefore I gravitate to archiving services instead. I do wish there was a better solution - for example Blendle with more sources.

3 comments

> The goal of OpenAI is not to reproduce newspaper articles verbatim when asked questions (even if the answer could be a newspaper article), and the fact that it can happen is a side effect of how LLMs work.

This is an excellent point. A properly functioning LLM should not return the original content it was trained on. When they return original content, I believe the prompt is tightly constrained and designed to extract or re-create original content. Another reason that occurred to me recently is that maybe the training set is too small, and more general prompts will re-create source material.

Another question would be, are LLMs regurgitating what they were trained on, or are they synthesizing something very close to the original content? (Infinite Monkeys, Shakespeare). Court cases like this increase the need for understanding the "thinking processes" in an LLM.

Maybe LLMs should follow best practices for 1980s style backprop models and later deep learning models: starve model size to force maximum generalization, minimal remembering.
> The goal of OpenAI is not to reproduce newspaper articles verbatim when asked questions (even if the answer could be a newspaper article) and the fact that it can happen is a side effect of how LLMs work.

Seems like a nice split-the-baby resolution would be to send the NYT Corp a single article read amount anytime GPT plagiarizes more than what’s allowed at an academic institution.