Hacker News new | ask | show | jobs
by schwabacher 3202 days ago
Be careful w/ these urls - I visited one and it appears my IP is now blocked from accessing yelp.com.
5 comments

What do you suppose would happen if we all visited these URLs from every public access point we use? (edit: and shared VPNs, Tor, corporate networks, etc.)
That's really interesting. They might be trying to blacklist scrapers that don't properly respect robots.txt files.
Why would a scraper who doesn't respect robots.txt be accessing that file?
If I was building an unruly scraper (which, in case our new overlords are listening, I would never do), I would read robots.txt so that I had a clue where the secret information that the company did not want me to read was located.

I'm not allowed to look in /documents/source/? Perfect. Let's start there.

They might be using it as a way to find specific pages which have content Yelp doesn't want you to scrape. The "evil" scenario.

They might also not be looking at the file, and just appending random words to the end of Yelp's biz URLs to scrape every business. Which, at some point, it might hit that URL by accident, since they are all words you'd find in other business URLs. Though this seems less likely.

I assume the GP was using their browser to access these links, though they still be blacklisted by Yelp?
Scrapers can be pretty shady, so they're isn't a good way to ensure that web traffic is coming from a legitimate human using a browser. To the server it's all just bits on the wire.
I just visited all of them: the IP was not blacklisted
Worked on my laptop, but when I tried the same thing on my phone, I was not blocked. (even when I request the desktop version of the site)
Because they whitelist the exit IP for the phone network, else one person could block access for millions.
And what a great loss that is :-)