| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hutrdvnj 1412 days ago
	I use Googlebot as my fake browsers user agent for years. It's really interested to explore the web, when everyone thinks you're Google.

5 comments

1vuio0pswjnm7 1412 days ago

Unless the originating IP address is a Google-controlled one, using Googlebot as a User-Agent header is (IME) generally no better than not sending a UA header at all.^1 If the goal is to make a server believe a request is coming from Google, then the request needs to be sent from a publicised Google-controlled IP address.^2

1. For many years I have had great results with not sending a UA header. It is also, IMO, an effective means to discover the true number of websites that refuse to fulfill a request in the absence of a UA header, which IME is extremely small. For that small handful of sites, one can send a "fake" UA header of one's choosing. sec.gov is an example of such a site.

2. http://developers.google.com/static/search/apis/ipranges/goo...

link

1vuio0pswjnm7 1411 days ago

Interestingly, lite.duckduckgo.com recently started requiring a User-Agent header, after many years of operating without this requirement. Are there any enforceable limits of what DDG can do with the UA header data. There has been no update to DDG's privacy policy.

link

trinovantes 1412 days ago

I wonder if fake bot detectors can distinguish between any Google IP like GCP instances (i.e. do they simply check the ASN) or crawler specific IPs

Or maybe google crawler also runs on GCP and it's indistinguishable from regular $5 compute users

link

windowsworkstoo 1412 days ago

Yes, Google and most major search engines enable a RDNS lookup to validate they are really a googlebot

link

trinovantes 1411 days ago

TIL. Found this for googlebot

https://developers.google.com/static/search/apis/ipranges/go...

link

simondotau 1412 days ago

Like the OP, I’ve employed a custom configuration in Cloudflare which detects (and blocks) browsers which claim to be Googlebot but don’t originate from Google’s approved Googlebot IP ranges.

The vast majority of such requests are dodgy scanning operations likely looking for email addresses or exploitable forms.

link

TrickyRick 1412 days ago

What are some of the most interesting differences you've seen?

link

hutrdvnj 1411 days ago

I think the best thing is that for some sites that has many ads, subscriber content, accept cookie pop-ups and/or captchas you will sometimes see that these are all gone and you get an ad free, full text version without pop ups and captchas. But that's only the case for websites that do not check for the origin IP and just rely on the user agent.

link

userbinator 1412 days ago

It is indeed interesting. Some sites even let you view their content without JS, registering/subscribing, and/or revert back to something approaching an unstyled static site without showing any ads, sidebars, or other useless content.

To add to some of the other experiences here about no-UA: I've tried that before too, and it was notably worse than pretending to be Google; lots of sites just return "Internal Server Error" or similar messages.

link

blitzar 1412 days ago

Do websites not spit at you or do they jsut assume you 'will do no evil'?

link