Hacker News new | ask | show | jobs
by johnc1231 1761 days ago
How do they identify a crawler? Can a user not masquerade as a crawler somehow?
2 comments

Sometimes they use user agents for this, but those are easily faked, so it's only done by websites that don't have comprehensive auth walls.

A more comprehensive method is based on ip ranges, say whitelisting traffic from Google and Bing. This gives you > 95% of search traffic as Google alone has >90%, and many various smaller search engines like Yahoo or DDG are Bing resellers.

On the other hand, a pure ip based check can be circumvented too. Sometimes you can view how search engines see a website through google translate. But places like LinkedIn have countermeasures for all of these circumventions.

Normally by user agent. You can adopt the same UA to get past some paywalls, but it doesn't always work.