Hacker News new | ask | show | jobs
by JohnFen 623 days ago
Since the web was widely scraped to train LLMs, I have to assume that the entirety of what I had up on the web was included. That's more than a "miniscule pinch". I consider it to be wholesale abuse. For me, money doesn't enter into it at all.

However, there's literally nothing I can do about it aside from withdrawing from the public web -- which is what I've done, aside from writing comments here. Until/unless there is some sort of effective way of defending against the crawlers, the open web is no longer a suitable place to publish anything.

2 comments

There's never going to be a way to defend against crawlers and still have an open web. Good actors may respect conventions like a robots.txt file but that's ultimately just a polite request.

You could get further trying to block by user agent headers, known crawler IPs, etc but then you're just taking up the same fight advertisers have with ad blockers.

> You could get further trying to block by user agent headers

That's a game of what-a-mole, though, and when the scraped data is being used to train LLMs, then a single miss is a really huge problem. That's why I gave up on that approach and took my sites off of the open web until some effective defense becomes possible.

The complaints I see are almost always aimed at the output of an LLM, and that only contains a significant amount of a work when it breaks.

Going after the LLM itself, not the output, is a lot trickier. Anyone can make a big database of public website contents. And if they use it to make a search engine for example, that gets classified as entirely legitimate. If we're excluding the output of the LLM, what's the difference?

Also if you scrunch down into a small model, it mathematically can't contain very much of the input text.

> Going after the LLM itself, not the output, is a lot trickier.

Exactly so, and this is why withdrawing from the open web is the only realistic solution at this time.