Hacker News new | ask | show | jobs
by jfoster 4170 days ago
Exactly. It's so difficult to get URLs "right", and that's quite non-obvious until you do something like writing a crawler.

Another example is whether foo.com/bar is the same as foo.com/BAR. Usually yes, but it's entirely possible that they will serve different content.

Also, which URL parameters should be disregarded, and which should be considered important? A crawler must do quite a bit of nontrivial page introspection in order to figure out the answer to that all on its own.

Often pages that are essentially the same will be a bit different. Timestamps and time-sensitive data (eg. listings on a marketplace) will trip you up, here.