Hacker News new | ask | show | jobs
by Dylan16807 617 days ago
> I believe that the GP's complaint is that their content online is actually being scraped and turned into value for companies, they would want compensation for it.

And the comment directly addresses that. If someone creates a valuable thing and it has a minuscule pinch of your content inside it, you shouldn't be complaining or demanding payment. That's how participating in culture is supposed to work. When someone copies you orders of magnitude more directly, that's when you should be compensated or have control over it.

2 comments

Since the web was widely scraped to train LLMs, I have to assume that the entirety of what I had up on the web was included. That's more than a "miniscule pinch". I consider it to be wholesale abuse. For me, money doesn't enter into it at all.

However, there's literally nothing I can do about it aside from withdrawing from the public web -- which is what I've done, aside from writing comments here. Until/unless there is some sort of effective way of defending against the crawlers, the open web is no longer a suitable place to publish anything.

There's never going to be a way to defend against crawlers and still have an open web. Good actors may respect conventions like a robots.txt file but that's ultimately just a polite request.

You could get further trying to block by user agent headers, known crawler IPs, etc but then you're just taking up the same fight advertisers have with ad blockers.

> You could get further trying to block by user agent headers

That's a game of what-a-mole, though, and when the scraped data is being used to train LLMs, then a single miss is a really huge problem. That's why I gave up on that approach and took my sites off of the open web until some effective defense becomes possible.

The complaints I see are almost always aimed at the output of an LLM, and that only contains a significant amount of a work when it breaks.

Going after the LLM itself, not the output, is a lot trickier. Anyone can make a big database of public website contents. And if they use it to make a search engine for example, that gets classified as entirely legitimate. If we're excluding the output of the LLM, what's the difference?

Also if you scrunch down into a small model, it mathematically can't contain very much of the input text.

> Going after the LLM itself, not the output, is a lot trickier.

Exactly so, and this is why withdrawing from the open web is the only realistic solution at this time.

That's a totally reasonable take, though it is just one opinion. I wouldn't tell someone they can't complain or feel entitled to payment for the value they created, though I bet we both agree that posting publicly online offers no expectation of payment by anyone coming across your content.