I don't have good answers. I have some high-level intuitions.
One of them is that creation costs of information are fixed, while its usefulness is unbounded, so it doesn't make sense to try and reward creators for each access/view/use, in perpetuity.
Secondly, there's a lot of information laundering going on - any random book I read carries between a few to few hundred references to prior written work. What I pay for the book goes to the author and the publishers, but AFAIK it doesn't go to any of the authors and publishers of works referenced in the book. Wikipedia takes this one step further, effectively turning all that information free.
Thirdly, AFAIK copyright explicitly does not cover information/knowledge - it covers specific works. So Google showing me an info box with a recipe scrapped from some site could technically fall afoul of the law - but an LLM generating me a recipe based on associations created from being trained on millions of recipes, this feels like it should be in the clear, at least from user's POV.
I think that is a somewhat narrow view. Maybe to make the contrast sharper: Why should I contribute any information just so that it immediately gets monetized by a handful of LLM firms?
The new situation isn't the same as search as that wasn't there to hide information sources or to immediately convert information into useful things (texts, guides, etc.).
> Why should I contribute any information just so that it immediately gets monetized by a handful of LLM firms?
If this matters to you, then you shouldn't. But to flip this around: why should you care?
Unless you're doing some unique work targeting a global audience, the point when LLM gets trained on what you created is way outside space you'd normally care about. Trying to capture all the value your work generates does not lead to a good world.
Or maybe it's me who isn't profit-minded enough, but e.g. a lot of what I wrote on-line, including blog articles and commentary on Reddit and HN, has been used by search engines for free for a long time (over a decade, in some cases), and now is (most likely) part of the training corpora for LLMs. But I never believed, and still don't believe, that I'm entitled to some share of the gains LLMs (or search engines) make.
Perhaps there will be a drop in high value information in the public domain, but right now, I can't exactly see LLMs impacting the incentives for creation and sharing of that information. I don't see how LLMs would make someone go "oh well, AI is here, I might as well stop providing people with no-strings-attached high quality information", if the existence of search engines didn't make them stop already.
For years people have been making travel blogs based on where they've visited and the practical information they've discovered, like experiences of visiting attractions or good places to stay in cities or how they got from one place to another. They monetised with ads and affiliate links so they could travel more based on that income.
In LLM land, they get no monetisation any more because nobody visits their sites, instead the LLM just regurgitates the answers they found.
The search engines actively supported these authors, by sending them people who needed the answers they had.
So in LLM land this information goes away because the feedback loop of the traveller creating information which earns them money to continue travelling goes away.
A LOT of the useful information on the web was built on similar feedback loops and they go away in LLM land.
Then perhaps they should find other means of livelihood, instead of preventing the rest of the world from making full use of the information and technology available to it.
They will and less information will be put into places that are freely accessible. If it's put anywhere at all it'll be put behind login only/paywalled/unscrapable places that LLM's can't access.
Ideas are copied by reading or hearing them. You can't own your ideas now, unless by own you mean horde. The perpetual creators rights you want extended are already artificial and require a non-trivial amount of our GDP to enforce and they still stifle future creation in a lot of areas.
Most people are paid for doing things every day, they don't get to create one thing and never work again. Expanding creators compensation laws is regressive and only helps a few elites survive job uncertainly, not the bulk of the people. We're better off limiting this sort of thing specifically to help everyone advance, share the knowledge.
> One of them is that creation costs of information are fixed, while its usefulness is unbounded, so it doesn't make sense to try and reward creators for each access/view/use, in perpetuity.
The word "creation" is loaded. No one "creates" content. They discover it hidden in some idea-space... occasionally even two people might discover the same thing. The same melody, the same verse of a poem, the same fragment of art.
The idea that one should be rewarded, but the other is slandered the infringer is amusingly dumb.
> but an LLM generating me a recipe based on associations created from being trained on millions of recipes, this feels like it should be in the clear, at least from user's POV
There is a big potential role for open source or more specifically copyleft / free AI here that is released as a community project but can be monetized as well. The evidence from software is that there is lots of interest in contributing to such products.
I'm organizing the publishing of my thoughts of technology development and design. This has been something I have been mulling for decades. Completely uncompensated. I'm not doing any of this for reasons I can understand it is just what I think about and do.
Originally I saw a website as a way to hang out a shingle. Until recently I was thinking I could just publish away and maybe someone would hire me based off the website.
Currently I don't feel the same way regarding publishing on the web. I will be me more guarded in what I share.
> Even if it isn't monetizeable IP, how to share the costs?
The internet started thanks to ample government funding for research. So have many other technologies, including AI.
I wonder if there's a way we could all somehow pool our resources and use that to pay for common goods that we all use. What would we call such a scheme?
> I wonder if there's a way we could all somehow pool our resources and use that to pay for common goods that we all use. What would we call such a scheme?
Is this tongue in cheek? I think it's called the government and taxes! :)
It isn’t clear to me (other than it’s an open engineering problem) why LLs couldn’t also include attribution as part of training. Also tracking attribution could lead to some insights on how its internal representations in vector space are created.
That is true I see no reason obvious reason why the LL companies take pride in not being able to document ideation process. I have no justification but I feel it is deceitful not technical reasoning.
The issue here is that memorization of any distinguishable part of IP is an incidental aspect - those models aren't memorizing stuff, they're learning it. We don't expect people to keep track of the source of every single piece of information they encounter. It would arguably make learning impossible - as much for humans as for LLMs.
As an intuition pump, when I write "2+2 = " and you mentally complete it with "4", should I chastise you for not completing it with "4, as per ${your elementary class math textbook} and ${that other book you read as a kid}, corroborated by ${your first math teacher} and ${your parent} quoting ${some other work}"?
When you make an omelette, what is the technical barrier making it practically impossible to tell which egg contributed how much to any given part of the meal?
Phind’s base model, which is GPT3.5/4 doesn’t itself do attribution, it’s made to do that with prompt engineering which provides the most relevant materials on the web based on a word embedding vector search, and then asks it to reference each source in the answer.
I mean, this is more-less what a student does when writing a paper, when they're forced to cite their sources. They first come up with an idea based on their own understanding/recollection, then they try to figure out where did they first took that idea from. If they remember a specific source, they'll cite that; if they don't (because there may not be one specific source they learned from), they'll search for some existing work that expresses the idea in question, and cite that.
I.e. in case of both the student and an LLM, correct citation doesn't actually mean the idea originates from the cited work - only that the work contains this idea.
Thank you I want to believe responsible development is happening. I just asked an LLM my first question and the interactive processing was great to watch.
One of them is that creation costs of information are fixed, while its usefulness is unbounded, so it doesn't make sense to try and reward creators for each access/view/use, in perpetuity.
Secondly, there's a lot of information laundering going on - any random book I read carries between a few to few hundred references to prior written work. What I pay for the book goes to the author and the publishers, but AFAIK it doesn't go to any of the authors and publishers of works referenced in the book. Wikipedia takes this one step further, effectively turning all that information free.
Thirdly, AFAIK copyright explicitly does not cover information/knowledge - it covers specific works. So Google showing me an info box with a recipe scrapped from some site could technically fall afoul of the law - but an LLM generating me a recipe based on associations created from being trained on millions of recipes, this feels like it should be in the clear, at least from user's POV.