| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Jake232 3933 days ago
	I have scrapers built in Python that do well over a million pages per day, but that's not really a benchmark you can use. It all depends on the amount of computation required to extract the page data among other things. You should be able to achieve > 120k per day for sure though. That's less than two per second.

1 comments

Thank you. Well, I'm doing several things:

1) I check whether or not the page we just scrapped has any of the tags we are looking for.

2) We then extract any information within those tags (images, etc.)

3) We follow trough every link and if it's not in the seen/scrapped list, we add them to the queue.

Not sure if this helps to narrow it down.

Thanks!