| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by JohnFen 623 days ago
	Since the web was widely scraped to train LLMs, I have to assume that the entirety of what I had up on the web was included. That's more than a "miniscule pinch". I consider it to be wholesale abuse. For me, money doesn't enter into it at all. However, there's literally nothing I can do about it aside from withdrawing from the public web -- which is what I've done, aside from writing comments here. Until/unless there is some sort of effective way of defending against the crawlers, the open web is no longer a suitable place to publish anything.

2 comments

_heimdall 623 days ago

There's never going to be a way to defend against crawlers and still have an open web. Good actors may respect conventions like a robots.txt file but that's ultimately just a polite request.

You could get further trying to block by user agent headers, known crawler IPs, etc but then you're just taking up the same fight advertisers have with ad blockers.

link

JohnFen 620 days ago

> You could get further trying to block by user agent headers

That's a game of what-a-mole, though, and when the scraped data is being used to train LLMs, then a single miss is a really huge problem. That's why I gave up on that approach and took my sites off of the open web until some effective defense becomes possible.

link

Dylan16807 623 days ago

The complaints I see are almost always aimed at the output of an LLM, and that only contains a significant amount of a work when it breaks.

Going after the LLM itself, not the output, is a lot trickier. Anyone can make a big database of public website contents. And if they use it to make a search engine for example, that gets classified as entirely legitimate. If we're excluding the output of the LLM, what's the difference?

Also if you scrunch down into a small model, it mathematically can't contain very much of the input text.

link

JohnFen 620 days ago

> Going after the LLM itself, not the output, is a lot trickier.

Exactly so, and this is why withdrawing from the open web is the only realistic solution at this time.

link