Hacker News new | ask | show | jobs
by version_five 1112 days ago
Imagine someone built a 100 trillion parameter model, Greg, who is a personal assistant. Greg had a context length of a billion tokens, and he can search the web and tell you what he's found. His memory is so good that he can quote verbatim the full text of anything he's read. It's not even compression, it's just straight up storage. Should you have to pay royalties to everyone whose content you ask Greg to look at?

What if Greg isn't an llm and he's your browser cache? Are you still infringing copyright?

4 comments

What if Greg is an online repository, crawling the web and storing and distributing copyrighted materials verbatim? Stripping out attribution? With ads and/or a paid subscription fee?

> Should you have to pay royalties to everyone whose content you ask Greg to look at?

If Greg talks so fast that he's distributing millions of these copies around the world, for money, then yes, of course he's infringing.

> What if Greg isn't an llm and he's your browser cache?

My browser cache is not a distribution mechanism. It's for my personal use. I'm not infringing on copyright if I keep books in my personal library. I am if I'm copying them millions of times and giving others access to that library for money. If I downloaded a bunch of paywalled content and then uploaded my browser cache to SomePirateSite.com, for money, then yes, I'm infringing.

Why do you think these are "gotcha" questions? This is pretty straightforward stuff, and nowhere does it prove that LLMs are not infringing.

You are definitely infringing by making a copy of a book and keeping it in your personal library.
> What if Greg is an online repository, crawling the web and storing and distributing copyrighted materials verbatim? Stripping out attribution? With ads and/or a paid subscription fee?

So you mean Google, right?

Yes. Google preventing page-clicks by showing their half-assed, confidently wrong summaries that they scraped directly from the top results? Yes, that's copyright infringement. Simply linking to the site with a short preview is not.
Your browser cache doesn’t repurpose the content it stores to create derivative works. LLMs do so by definition.
Or what if Greg is a human with eidetic memory?
Humans and software are not the same. Humans get a pass on regurgitating some stuff because our memories are fuzzy and more importantly we are not eternal, distributable entities that scale based on GPUs available.

Kim Peek is Greg and the difference between him and and AI is the text above.

Isn't this called Google search?