|
|
|
|
|
by thaumaturgy
4170 days ago
|
|
URLs are really, really, really, really hard to get right on a large scale. For a side project I've written my own crawler/indexer and I try to do deduplication where possible, and the reality is that: domain.com/this-page-here
can serve entirely different content from domain.com/this-page-here/
depending on the server (and application) configuration.Pretty much the only way to 100% reliably deduplicate URLs is to look at their content, and somehow magically compare content that can change from page load to page load -- which is a whole other problem. |
|
Another example is whether foo.com/bar is the same as foo.com/BAR. Usually yes, but it's entirely possible that they will serve different content.
Also, which URL parameters should be disregarded, and which should be considered important? A crawler must do quite a bit of nontrivial page introspection in order to figure out the answer to that all on its own.
Often pages that are essentially the same will be a bit different. Timestamps and time-sensitive data (eg. listings on a marketplace) will trip you up, here.