Hacker News new | ask | show | jobs
by hutrdvnj 1365 days ago
I use Googlebot as my fake browsers user agent for years. It's really interested to explore the web, when everyone thinks you're Google.
5 comments

Unless the originating IP address is a Google-controlled one, using Googlebot as a User-Agent header is (IME) generally no better than not sending a UA header at all.^1 If the goal is to make a server believe a request is coming from Google, then the request needs to be sent from a publicised Google-controlled IP address.^2

1. For many years I have had great results with not sending a UA header. It is also, IMO, an effective means to discover the true number of websites that refuse to fulfill a request in the absence of a UA header, which IME is extremely small. For that small handful of sites, one can send a "fake" UA header of one's choosing. sec.gov is an example of such a site.

2. http://developers.google.com/static/search/apis/ipranges/goo...

Interestingly, lite.duckduckgo.com recently started requiring a User-Agent header, after many years of operating without this requirement. Are there any enforceable limits of what DDG can do with the UA header data. There has been no update to DDG's privacy policy.
I wonder if fake bot detectors can distinguish between any Google IP like GCP instances (i.e. do they simply check the ASN) or crawler specific IPs

Or maybe google crawler also runs on GCP and it's indistinguishable from regular $5 compute users

Yes, Google and most major search engines enable a RDNS lookup to validate they are really a googlebot
Like the OP, I’ve employed a custom configuration in Cloudflare which detects (and blocks) browsers which claim to be Googlebot but don’t originate from Google’s approved Googlebot IP ranges.

The vast majority of such requests are dodgy scanning operations likely looking for email addresses or exploitable forms.

What are some of the most interesting differences you've seen?
I think the best thing is that for some sites that has many ads, subscriber content, accept cookie pop-ups and/or captchas you will sometimes see that these are all gone and you get an ad free, full text version without pop ups and captchas. But that's only the case for websites that do not check for the origin IP and just rely on the user agent.
It is indeed interesting. Some sites even let you view their content without JS, registering/subscribing, and/or revert back to something approaching an unstyled static site without showing any ads, sidebars, or other useless content.

To add to some of the other experiences here about no-UA: I've tried that before too, and it was notably worse than pretending to be Google; lots of sites just return "Internal Server Error" or similar messages.

Do websites not spit at you or do they jsut assume you 'will do no evil'?