| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by janejeon 958 days ago
	Really dummy question: how do services like this work? As in, how do they bypass these paywalls? The obvious thing is to mock Googlebot, but site owners can check that the request isn't coming from a Google-published IP and see that it's a fake, right?

3 comments

Fnoord 958 days ago

Some possible clues:

> https://github.com/kubero-dev/ladder#environment-variables

> USER_AGENT User agent to emulate Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

> X_FORWARDED_FOR IP forwarder address 66.249.66.1

> RULESET URL to a ruleset file https://raw.githubusercontent.com/kubero-dev/ladder/main/rul... or /path/to/my/rules.yaml

link

janejeon 958 days ago

Oh wow... I'm surprised that's enough. When I was researching scraping protection bypass, you had to do some real crazy stuff with the browser instance + using residential IPs at a minimum...

link

2cpu1container 958 days ago

Thats not the full story. It works on many sites, but some (ft.com as an example) have more severe countermeasures to bypass the paywall. Therefore the ladders modifies the served HTML from origin to remove such.

Those rules still need to be build up. (by me or the OS-community)

link

ComputerGuru 958 days ago

I don’t know of any off-the-shelf product that respects X_FORWARDED_FOR unless the current request ip originates from a whitelisted (or lan) address.

link

narinxas 958 days ago

> site owners can check that the request isn't coming from a Google-published IP and see that it's a fake, right?

just because they can doesn't mean they will... also most "site owners" are (by this point) a completely different people than "site operators" (who I take to be the 'engineers' who indeed can check this IP things)

link

calflegal 958 days ago

related: If this is how they work, why doesn't google offer a private service to allow publishers to have content indexed while still protected?

link

matsemann 958 days ago

It used to be against guidelines to serve different content to google vs what users would see. Not sure if still the case, but I don't think it's in google's interest to give a result that the user actually can't access.

link

ComputerGuru 958 days ago

I’m not aware that this policy has changed. What has changed is that Google will rank results it can’t (officially) index without showing their content. I’m guessing they do shadow index them but use the whole “if you outwardly can’t tell they did then it’s as if they didn’t” C++ compilers use to get away with insane optimizations.

link