Hacker News new | ask | show | jobs
by AstroBen 272 days ago
They torrented a shitload of books illegally and trained on them.. but they're unable to get past The Great Wall of robots.txt?
4 comments

I doubt it's their only countermeasure

plus it's a pretty dangerous game for them to play against large, powerful actors with legions of lawyers

> plus it's a pretty dangerous game for them to play against large, powerful actors with legions of lawyers

Like book publishers?

If the AI crawlers circumvent the protection mechanisms it's a serious crime now rather than just "Well it was on the open internet for free". Wouldn't surprise me if the the news orgs are also looking at honeypot articles to see if the fake details slip in to LLMs.
It's not a serious crime, or any crime at all, to ignore robots.txt. It's entirely voluntary whether you want to follow it or not. If you don't, you're being a dick maybe, but that's not a crime.
It's not just robots.txt, if you've tried using a VPN lately, so many sites like reddit/youtube/etc block you from viewing content until you log in. Every major website is getting anti scraping tech in the last year. Even archive.org is getting blocked from more and more sites since it can be used for indirect scraping of sites.
robots.txt prevents real time search use for grounding and citations.
No it doesn't. It has zero legal force. Or any technical force either.
Not an expert so I ask: no technical force either? Is it just a polite ask then?
It's hardly even a polite ask. It's literally a text file. Automated http clients, such as search engine indexers (Google, yahoo, etc) are expected to use it to know what pages can be visited or not. That expectation is nothing more than a convention.

If you are on a Mac or Linux computer, odds are it has a program called curl pre-installed. If you type in curl website address in a terminal, it'll fetch make a request and download the response. Robot.txt never gets involved. Same is true for AI agents and search engines that aren't polite.

Linkedin lost their anti-scrapping suit: https://www.forbes.com/sites/zacharysmith/2022/04/18/scrapin... but it seems since then they were able to successfully appeal that decision.

Regardless - requiring an account to read anything, even a "free" one, totally changes whole situation. Even when sites terms of service are limited by local law.

Correct. Literally just a polite ask.
Great Firewall actually. Robots.txt depended on the integrity of the companies crawling. I think they have demonstrated how much integrity they actually have:)