Hacker News new | ask | show | jobs
by lazyant 2017 days ago
Web pages (URLs) is not a DAG and hence it can have loops. Regardless, even if I've never designed a web crawler, I'd think a basic feature would be deduplication; a database (table) of URLs visited with a timestamp (so you can visit again after X days to check for changes, this refresh rate can be also included in the table per URL), so the crawler would check this table before visiting a URL.
1 comments

Trust me, that's not the first thing you think about when designing your scraper.

Typically, one doesn't care whether the same page has been visited before. What one does care about is avoiding storing duplicate data.

> a basic feature would be deduplication