| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by xurukefi 123 days ago
	There are ways to work around this. I've just tested this: I've used the URL inspection tool of Google Search Console to fetch a URL from my website, which I've configured to redirect to a paywalled news article. Turns out the crawler follows that redirect and gives me the full source code of the redirected web site, without any paywall. That's maybe a bit insane to automate at the scale of archive.today, but I figure they do something along the lines of this. It's a perfect imitation of Googlebot because it is literally Googlebot.

2 comments

jsheard 123 days ago

I'd file that under "doesn't know what they're doing" because the search console uses a totally different user-agent (Google-InspectionTool) and the site is blindly treating it the same as Googlebot :P

Presumably they are just matching on *Google* and calling it a day.

link

xurukefi 123 days ago

Sure, but maybe there are other ways to control Googlebot in a similar fashion. Maybe even with a pristine looking User-Agent header.

link

Aurornis 123 days ago

> which I've configured to redirect to a paywalled news article.

Which specific site with a paywall?

link