Hacker News new | ask | show | jobs
by obelisk_ 3772 days ago
1. Google's Web Crawlers are not "bypassing" paywall. It's the paywall that let's crawlers through. I.e. exactly the reverse of what the author implies with their headline.

2. The idea that this is somehow new is wrong. The way for a server to identify crawlers have "always" been to look at the user-agent, and, when done right, IP, verified either by net block owner or by doing PTR lookup and then checking that the A or AAA record for the claimed host points back at the same IPv4 or IPv6 address. Meanwhile, I do agree that paywalling is a more recent phenomenon, at least with regards to the extend it is popular among sites today, but the concept of presenting different data to crawlers and visitors arose much earlier and is something Google have been aware of and has made sure to delist such sites when found, whereas in fact Google has since then moved abit in the direction of allowing it in that they do so for Google News if declared as explained by others ITT.

So in my view, it seems that the author is jumping to incorrect conclusions based on an incomplete understanding of what's actually going on here. What then about the HN readership, how come this article became so highly voted and I don't see these issues raised by anyone else? Or maybe I'm just crazy?

1 comments

> Google's Web Crawlers are not "bypassing" paywall. It's the paywall that let's crawlers through. I.e. exactly the reverse of what the author implies with their headline.

Don't nitpick. It's just a shortened version of How To "Be" a Google’s Web Crawler to Bypass Paywalls. You get it. I get it. Everyone gets it.