Hacker News new | ask | show | jobs
by easy_rider 4582 days ago
You can implement some strict enforcing in Apache using some crafty mod_rewrite stuff: http://andthatsjazz.org/defeat.html

User-agent is to easily spoofed, but we could check if the robots are indeed Google (whitelisted) and not some other crawler that just wants to scrape your content.

In the realm of mail servers we have something called SPF: http://en.wikipedia.org/wiki/Sender_Policy_Framework

Just thinking out of the box here, but other than checking IP ranges: Maybe a hash being sent as a header inside the GET request by the crawler to verify if they are who they say they are.