Hacker News new | ask | show | jobs
by 013a 3202 days ago
That's really interesting. They might be trying to blacklist scrapers that don't properly respect robots.txt files.
2 comments

Why would a scraper who doesn't respect robots.txt be accessing that file?
If I was building an unruly scraper (which, in case our new overlords are listening, I would never do), I would read robots.txt so that I had a clue where the secret information that the company did not want me to read was located.

I'm not allowed to look in /documents/source/? Perfect. Let's start there.

They might be using it as a way to find specific pages which have content Yelp doesn't want you to scrape. The "evil" scenario.

They might also not be looking at the file, and just appending random words to the end of Yelp's biz URLs to scrape every business. Which, at some point, it might hit that URL by accident, since they are all words you'd find in other business URLs. Though this seems less likely.

I assume the GP was using their browser to access these links, though they still be blacklisted by Yelp?
Scrapers can be pretty shady, so they're isn't a good way to ensure that web traffic is coming from a legitimate human using a browser. To the server it's all just bits on the wire.