| Funny, I saw this HN headline just after banning another scraper's IP range You're welcome to scrape my sites but please do it ethically. Idk how to define that but some examples of things I consider not cool: - Scraping without a contact method, or at least some unique identifier (like your project's codename), in the user agent string. This is common practice, see e.g.: <https://en.wikipedia.org/wiki/User-Agent_header#Format_for_a...>. Many sites mention in public API guidelines to include an email address so you can be contacted in case of problems. If you don't include this and you're causing trouble, all I can do is ban your IP address altogether (or entire ranges: if you hop between several IPs I'll have to assume you have access to the whole range). Nobody likes IP bans: you have to get a new IP, your provider has a burned IP address, the next customer runs into issues... don't be this person, include an identifier. - Timing out the request after a few seconds. Some pages on my site involve number crunching and take 20 seconds to load. I could add complexity to do this async instead, but, by having it live, the regular users get the latest info and they know to just wait a few seconds and everybody is happy. Even the scrapers can get the info, I'm fine computing those pages for you. But if you ask for me to do work and then walk away, that's just rude. It shows up in my logs as HTTP status 499 and I'll ban scrapers that I notice doing this regularly - Ignoring robots.txt. I have exactly 1 entry in there, and that's a caching proxy for another site that is struggling with load. If you ignore the robots file and just crawl the thing from A to Z at a high rate, that causes a lot of requests to the upstream site for updating stale caches. You can obviously expect a ban because it's again just a waste of resources |