Hacker News new | ask | show | jobs
by tlack 3353 days ago
what did you do to avoid winding up in endless GET url loops? How deep did you get per site, and how did you schedule followup requests?
1 comments

Loop/spam prevention was done by mixnode, I'm not sure how they do it.

The data does not follow a DFS or BFS pattern so pages/site varies greatly by a host's server capacity and anti-crawling configs.

There was a minimum of 10 seconds between followup requests to the same website unless robots.txt had a lower delay. Pretty polite...