| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 013a 3202 days ago
	That's really interesting. They might be trying to blacklist scrapers that don't properly respect robots.txt files.

2 comments

tossaway1 3202 days ago

Why would a scraper who doesn't respect robots.txt be accessing that file?

link

beobab 3202 days ago

If I was building an unruly scraper (which, in case our new overlords are listening, I would never do), I would read robots.txt so that I had a clue where the secret information that the company did not want me to read was located.

I'm not allowed to look in /documents/source/? Perfect. Let's start there.

link

013a 3202 days ago

They might be using it as a way to find specific pages which have content Yelp doesn't want you to scrape. The "evil" scenario.

They might also not be looking at the file, and just appending random words to the end of Yelp's biz URLs to scrape every business. Which, at some point, it might hit that URL by accident, since they are all words you'd find in other business URLs. Though this seems less likely.

link

a012 3202 days ago

I assume the GP was using their browser to access these links, though they still be blacklisted by Yelp?

link

wickawic 3202 days ago

Scrapers can be pretty shady, so they're isn't a good way to ensure that web traffic is coming from a legitimate human using a browser. To the server it's all just bits on the wire.

link