Hacker News new | ask | show | jobs
by hansvm 2017 days ago
(2) Is a great idea I hadn't considered. A surprising number of sites require "browser" user-agents but otherwise have well-defined rate limits, robots.txt files, and everything you'd need to write a respectful crawler.

I'm not sure that (4) matters for larger sites? Their rate limits are usually a drop in the bucket compared to the background traffic.

1 comments

#4 was more to avoid being noticed by someone like me before they’ve had their morning coffee. That being said, if anything does go wrong, and you’ve ramped up slowly, at least it gives autoscaling time to respond.

Generally, though, unless you screw up badly, submit forms, or blend in with a more problematic crawler, nobody’s going to care (or even notice).