| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by AstroBen 272 days ago
	They torrented a shitload of books illegally and trained on them.. but they're unable to get past The Great Wall of robots.txt?

4 comments

blibble 272 days ago

I doubt it's their only countermeasure

plus it's a pretty dangerous game for them to play against large, powerful actors with legions of lawyers

link

selcuka 272 days ago

> plus it's a pretty dangerous game for them to play against large, powerful actors with legions of lawyers

Like book publishers?

link

Gigachad 272 days ago

If the AI crawlers circumvent the protection mechanisms it's a serious crime now rather than just "Well it was on the open internet for free". Wouldn't surprise me if the the news orgs are also looking at honeypot articles to see if the fake details slip in to LLMs.

link

crazygringo 272 days ago

It's not a serious crime, or any crime at all, to ignore robots.txt. It's entirely voluntary whether you want to follow it or not. If you don't, you're being a dick maybe, but that's not a crime.

link

Gigachad 272 days ago

It's not just robots.txt, if you've tried using a VPN lately, so many sites like reddit/youtube/etc block you from viewing content until you log in. Every major website is getting anti scraping tech in the last year. Even archive.org is getting blocked from more and more sites since it can be used for indirect scraping of sites.

link

simianwords 272 days ago

robots.txt prevents real time search use for grounding and citations.

link

crazygringo 272 days ago

No it doesn't. It has zero legal force. Or any technical force either.

link

simianwords 272 days ago

Not an expert so I ask: no technical force either? Is it just a polite ask then?

link

zdragnar 272 days ago

It's hardly even a polite ask. It's literally a text file. Automated http clients, such as search engine indexers (Google, yahoo, etc) are expected to use it to know what pages can be visited or not. That expectation is nothing more than a convention.

If you are on a Mac or Linux computer, odds are it has a program called curl pre-installed. If you type in curl website address in a terminal, it'll fetch make a request and download the response. Robot.txt never gets involved. Same is true for AI agents and search engines that aren't polite.

link

euLh7SM5HDFY 272 days ago

Linkedin lost their anti-scrapping suit: https://www.forbes.com/sites/zacharysmith/2022/04/18/scrapin... but it seems since then they were able to successfully appeal that decision.

Regardless - requiring an account to read anything, even a "free" one, totally changes whole situation. Even when sites terms of service are limited by local law.

link

crazygringo 272 days ago

Correct. Literally just a polite ask.

link

strangattractor 272 days ago

Great Firewall actually. Robots.txt depended on the integrity of the companies crawling. I think they have demonstrated how much integrity they actually have:)

link