Hacker News new | ask | show | jobs
by janejeon 958 days ago
Really dummy question: how do services like this work? As in, how do they bypass these paywalls?

The obvious thing is to mock Googlebot, but site owners can check that the request isn't coming from a Google-published IP and see that it's a fake, right?

3 comments

Some possible clues:

> https://github.com/kubero-dev/ladder#environment-variables

> USER_AGENT User agent to emulate Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

> X_FORWARDED_FOR IP forwarder address 66.249.66.1

> RULESET URL to a ruleset file https://raw.githubusercontent.com/kubero-dev/ladder/main/rul... or /path/to/my/rules.yaml

Oh wow... I'm surprised that's enough. When I was researching scraping protection bypass, you had to do some real crazy stuff with the browser instance + using residential IPs at a minimum...
Thats not the full story. It works on many sites, but some (ft.com as an example) have more severe countermeasures to bypass the paywall. Therefore the ladders modifies the served HTML from origin to remove such.

Those rules still need to be build up. (by me or the OS-community)

I don’t know of any off-the-shelf product that respects X_FORWARDED_FOR unless the current request ip originates from a whitelisted (or lan) address.
> site owners can check that the request isn't coming from a Google-published IP and see that it's a fake, right?

just because they can doesn't mean they will... also most "site owners" are (by this point) a completely different people than "site operators" (who I take to be the 'engineers' who indeed can check this IP things)

related: If this is how they work, why doesn't google offer a private service to allow publishers to have content indexed while still protected?
It used to be against guidelines to serve different content to google vs what users would see. Not sure if still the case, but I don't think it's in google's interest to give a result that the user actually can't access.
I’m not aware that this policy has changed. What has changed is that Google will rank results it can’t (officially) index without showing their content. I’m guessing they do shadow index them but use the whole “if you outwardly can’t tell they did then it’s as if they didn’t” C++ compilers use to get away with insane optimizations.