Hacker News new | ask | show | jobs
by RandomLensman 1064 days ago
And the cost of creating the information you want should be borne by whom? Even if it isn't monetizeable IP, how to share the costs?
4 comments

I don't have good answers. I have some high-level intuitions.

One of them is that creation costs of information are fixed, while its usefulness is unbounded, so it doesn't make sense to try and reward creators for each access/view/use, in perpetuity.

Secondly, there's a lot of information laundering going on - any random book I read carries between a few to few hundred references to prior written work. What I pay for the book goes to the author and the publishers, but AFAIK it doesn't go to any of the authors and publishers of works referenced in the book. Wikipedia takes this one step further, effectively turning all that information free.

Thirdly, AFAIK copyright explicitly does not cover information/knowledge - it covers specific works. So Google showing me an info box with a recipe scrapped from some site could technically fall afoul of the law - but an LLM generating me a recipe based on associations created from being trained on millions of recipes, this feels like it should be in the clear, at least from user's POV.

I think that is a somewhat narrow view. Maybe to make the contrast sharper: Why should I contribute any information just so that it immediately gets monetized by a handful of LLM firms?

The new situation isn't the same as search as that wasn't there to hide information sources or to immediately convert information into useful things (texts, guides, etc.).

> Why should I contribute any information just so that it immediately gets monetized by a handful of LLM firms?

If this matters to you, then you shouldn't. But to flip this around: why should you care?

Unless you're doing some unique work targeting a global audience, the point when LLM gets trained on what you created is way outside space you'd normally care about. Trying to capture all the value your work generates does not lead to a good world.

Or maybe it's me who isn't profit-minded enough, but e.g. a lot of what I wrote on-line, including blog articles and commentary on Reddit and HN, has been used by search engines for free for a long time (over a decade, in some cases), and now is (most likely) part of the training corpora for LLMs. But I never believed, and still don't believe, that I'm entitled to some share of the gains LLMs (or search engines) make.

This isn't so much about compensation, but why should I help enrich a large, even more direct rent seeker?

Valuable information in a way is becoming more valuable for the LLM provider, so I would expect a drop in high value information in the public domain.

Perhaps there will be a drop in high value information in the public domain, but right now, I can't exactly see LLMs impacting the incentives for creation and sharing of that information. I don't see how LLMs would make someone go "oh well, AI is here, I might as well stop providing people with no-strings-attached high quality information", if the existence of search engines didn't make them stop already.
For years people have been making travel blogs based on where they've visited and the practical information they've discovered, like experiences of visiting attractions or good places to stay in cities or how they got from one place to another. They monetised with ads and affiliate links so they could travel more based on that income.

In LLM land, they get no monetisation any more because nobody visits their sites, instead the LLM just regurgitates the answers they found.

The search engines actively supported these authors, by sending them people who needed the answers they had.

So in LLM land this information goes away because the feedback loop of the traveller creating information which earns them money to continue travelling goes away.

A LOT of the useful information on the web was built on similar feedback loops and they go away in LLM land.

A lot of people justifiably care because making that information is their livelihood. Your entitlement to the labor of others is gross.
Then perhaps they should find other means of livelihood, instead of preventing the rest of the world from making full use of the information and technology available to it.
They will and less information will be put into places that are freely accessible. If it's put anywhere at all it'll be put behind login only/paywalled/unscrapable places that LLM's can't access.
Ideas are copied by reading or hearing them. You can't own your ideas now, unless by own you mean horde. The perpetual creators rights you want extended are already artificial and require a non-trivial amount of our GDP to enforce and they still stifle future creation in a lot of areas.

Most people are paid for doing things every day, they don't get to create one thing and never work again. Expanding creators compensation laws is regressive and only helps a few elites survive job uncertainly, not the bulk of the people. We're better off limiting this sort of thing specifically to help everyone advance, share the knowledge.

I think that is somewhat off topic. I don't see why rent seeking via IP laws should be bad while doing it via provision of AI wouldn't be.
Your entitlement to the labor of others is gross.
> One of them is that creation costs of information are fixed, while its usefulness is unbounded, so it doesn't make sense to try and reward creators for each access/view/use, in perpetuity.

The word "creation" is loaded. No one "creates" content. They discover it hidden in some idea-space... occasionally even two people might discover the same thing. The same melody, the same verse of a poem, the same fragment of art.

The idea that one should be rewarded, but the other is slandered the infringer is amusingly dumb.

> but an LLM generating me a recipe based on associations created from being trained on millions of recipes, this feels like it should be in the clear, at least from user's POV

But how will we entrench the rent-seekers?

The biggest reason I care/have fear about information I will share freely getting absorbed into AI is...

I definitely believe that the I could get sued for my own ideas. The chance of lawyer claiming AI say it is their idea sounds horrible.

There is a big potential role for open source or more specifically copyleft / free AI here that is released as a community project but can be monetized as well. The evidence from software is that there is lots of interest in contributing to such products.
I'm organizing the publishing of my thoughts of technology development and design. This has been something I have been mulling for decades. Completely uncompensated. I'm not doing any of this for reasons I can understand it is just what I think about and do.

Originally I saw a website as a way to hang out a shingle. Until recently I was thinking I could just publish away and maybe someone would hire me based off the website.

Currently I don't feel the same way regarding publishing on the web. I will be me more guarded in what I share.

> Even if it isn't monetizeable IP, how to share the costs?

The internet started thanks to ample government funding for research. So have many other technologies, including AI.

I wonder if there's a way we could all somehow pool our resources and use that to pay for common goods that we all use. What would we call such a scheme?

We could definitely fund things. Or tax LLM firms in some broad way to redistribute to the respective societies for using their creations.
> I wonder if there's a way we could all somehow pool our resources and use that to pay for common goods that we all use. What would we call such a scheme?

Is this tongue in cheek? I think it's called the government and taxes! :)

I want attribution if I inspire a thought in AI.

I'm surprised the nerve the issue has struck in me.

It isn’t clear to me (other than it’s an open engineering problem) why LLs couldn’t also include attribution as part of training. Also tracking attribution could lead to some insights on how its internal representations in vector space are created.
That is true I see no reason obvious reason why the LL companies take pride in not being able to document ideation process. I have no justification but I feel it is deceitful not technical reasoning.
The issue here is that memorization of any distinguishable part of IP is an incidental aspect - those models aren't memorizing stuff, they're learning it. We don't expect people to keep track of the source of every single piece of information they encounter. It would arguably make learning impossible - as much for humans as for LLMs.

As an intuition pump, when I write "2+2 = " and you mentally complete it with "4", should I chastise you for not completing it with "4, as per ${your elementary class math textbook} and ${that other book you read as a kid}, corroborated by ${your first math teacher} and ${your parent} quoting ${some other work}"?

What is the hard technical barrier that makes the tracking of attribution for input sequences for LLM training impossible? I don't see any.
When you make an omelette, what is the technical barrier making it practically impossible to tell which egg contributed how much to any given part of the meal?

It's roughly the same thing.

I have really enjoyed using https://phind.com, which includes attribution in its responses.
Phind’s base model, which is GPT3.5/4 doesn’t itself do attribution, it’s made to do that with prompt engineering which provides the most relevant materials on the web based on a word embedding vector search, and then asks it to reference each source in the answer.
I mean, this is more-less what a student does when writing a paper, when they're forced to cite their sources. They first come up with an idea based on their own understanding/recollection, then they try to figure out where did they first took that idea from. If they remember a specific source, they'll cite that; if they don't (because there may not be one specific source they learned from), they'll search for some existing work that expresses the idea in question, and cite that.

I.e. in case of both the student and an LLM, correct citation doesn't actually mean the idea originates from the cited work - only that the work contains this idea.

Thank you I want to believe responsible development is happening. I just asked an LLM my first question and the interactive processing was great to watch.