|
|
|
|
|
by lazyant
2017 days ago
|
|
Web pages (URLs) is not a DAG and hence it can have loops. Regardless, even if I've never designed a web crawler, I'd think a basic feature would be deduplication; a database (table) of URLs visited with a timestamp (so you can visit again after X days to check for changes, this refresh rate can be also included in the table per URL), so the crawler would check this table before visiting a URL. |
|
Typically, one doesn't care whether the same page has been visited before. What one does care about is avoiding storing duplicate data.