| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by alexdoma 3112 days ago
	What would be a better solution, IP address check to allow only known google crawlers perhaps?

3 comments

andreareina 3112 days ago

Classify IPs based on their recent behavior[2]. Most bots behave very differently from the median user, along many different dimensions -- volume of requests, time between requests, visit length, which links are followed, etc.

And if this means that bots are altered to become indistinguishable from users, and therefore have a minimal impact on a site's loading? Well, mission accomplished[1].

[1] https://xkcd.com/810/

ETA: [2] Recent behavior (as opposed to all historical behavior) is used so that someone inheriting a "bad" IP isn't completely screwed over.

link

mark-r 3112 days ago

That's a superb xkcd that I hadn't seen yet, thanks.

link

marcosdumay 3112 days ago

The real solution is disallowing behaviors, instead of shibboleths.

It's surprising that malicious bots aren't exploiting those things already.

link

ben_w 3111 days ago

That practically invites them to present a different page to google as to a normal user, the former pure SEO, the latter perhaps pure advertising.

link

fjsolwmv 3111 days ago

And Google will happily deindex the site as soon as they find out

link

ben_w 3110 days ago

Raising an interesting question: can a website owner (use the law to) ban google from accessing their website by any mechanism other than their crawler in order that google doesn’t find out?

Sure, obviously limited utility just like “the right to be forgotten”s flaw of diffing USA internet from EU internet to find specifically what people want forgotten, but shenanigans interest me.

link