| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kelnos 228 days ago

> robots.txt is a polite request to please not scrape these pages

People who ignore polite requests are assholes, and we are well within our rights to complain about them.

I agree that "theft" is too strong (though I think you might be presenting a straw man there), but "abuse" can be perfectly apt: a crawler hammering a server, requesting the same pages over and over, absolutely is abuse.

> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

That's a shitty world that we shouldn't have to live in.

2 comments

wslh 228 days ago

> People who ignore polite requests are assholes, and we are well within our rights to complain about them.

If you are building a new search engine and the robots.txt only include Google, are you an asshole indexing the information?

link

kijin 228 days ago

Yes, because the site owner has clearly and explicitly requested that you don't scrape their site, fully accepting the consequence that their site will not appear in any search engine other than Google.

Whatever impact your new search engine or LLM might have in the world is irrelevant to their wishes.

link

DoctorOetker 227 days ago

Whenever one forms a sentence, it is worthwhile to try to form a sentence that you believe to be generally true.

If someone politely requests you to suck their genitalia, and you ignore that request, does that make you an asshole?

link