Hacker News new | ask | show | jobs
by JohnFen 1041 days ago
> blocking GPTBot will not guarantee that a site's data does not end up training all AI models of the future. Aside from issues of scrapers ignoring robots.txt files, there are other large data sets of scraped websites (such as The Pile) that are not affiliated with OpenAI.

This is why I'm not reassured. robots.txt isn't sufficient to stop all webcrawlers, so there every reason to think it isn't sufficient to stop AI scrapers.

I'm still wanting to find a good solution to this problem so that I can open my sites up to the public again.

3 comments

I think bots are part of the public.
OK, then pretend that I said "open my site up to the human public", instead.
There's never going to be a perfect solution, it's an arms race. I really doubt (hope?) that large entities are going to straight up emulate end-user browsers though.

I would think filtering based on user agent will be the sweet spot for effort and performance. You could do some awful JavaScript monstrosity to detect the tiny fraction of bots who are sneaky, but if they're determined to be sneaky they will succeed at scraping.

User agent matching isn't good enough. The stakes are high -- all it takes is one AI crawler to grab my site data, and that data is included in the training forever more.

> if they're determined to be sneaky they will succeed at scraping.

Yes, which is why I suspect I will never be able to open my websites up to the general public again. I live in hope anyway.

Browsers aren't really trusted platforms, the cool scraping is in emulating phones. Whether that be in actually running a virtual phone or sending traffic that emulates it

Really just encourages phones to be even more locked down

I chose to use an nginx entry, because i also dont trust them to follow robots.txt. Throwing a 410 Gone should keep them from coming back too, theoretically, assuming they actually eject when receiving it, like it should.

`if ($http_user_agent ~* ".*?(GPTBot|AI).*?") { return 410; }`

Its not perfect, but it should filter them indefinitely, will probably have to add some more terms in there over time.

That's relying on the user agent, though. That's not a trustworthy enough signal for me. For one, crawlers can use any user agent string they like. For another, I don't know what all the possible user agent strings are.