Hacker News new | ask | show | jobs
by alexdoma 3112 days ago
What would be a better solution, IP address check to allow only known google crawlers perhaps?
3 comments

Classify IPs based on their recent behavior[2]. Most bots behave very differently from the median user, along many different dimensions -- volume of requests, time between requests, visit length, which links are followed, etc.

And if this means that bots are altered to become indistinguishable from users, and therefore have a minimal impact on a site's loading? Well, mission accomplished[1].

[1] https://xkcd.com/810/

ETA: [2] Recent behavior (as opposed to all historical behavior) is used so that someone inheriting a "bad" IP isn't completely screwed over.

That's a superb xkcd that I hadn't seen yet, thanks.
The real solution is disallowing behaviors, instead of shibboleths.

It's surprising that malicious bots aren't exploiting those things already.

That practically invites them to present a different page to google as to a normal user, the former pure SEO, the latter perhaps pure advertising.
And Google will happily deindex the site as soon as they find out
Raising an interesting question: can a website owner (use the law to) ban google from accessing their website by any mechanism other than their crawler in order that google doesn’t find out?

Sure, obviously limited utility just like “the right to be forgotten”s flaw of diffing USA internet from EU internet to find specifically what people want forgotten, but shenanigans interest me.