Hacker News new | ask | show | jobs
by ot 4693 days ago
Technically speaking, there are no equivalent URLs in general, different strings may lead to different resources.

Still, there are a number of common sense heuristics to normalize URLs, that HN applies to do de-duplication. I was wondering what is the rationale for not having trailing slash removal among them. I mean, is there any legitimate website that serves a different resource if you remove the trailing slash?

3 comments

Per RFC3986/7, http://example.com/%60 and http://example.com/a are equivilant. (Indeed, all major browsers will request the latter regardless of what is input.) Equally, punycode encoded IRIs and the original IRI are equivilance. There is a whole section on equivilance in both of the RFCs (3967 includes 3986 by reference, so is a superset).
Browser equivalence is another thing entirely. Most browsers will accept http://www。google。com (because in Japanese '。' is '.'). But if you tried to request that actual resource it doesn't lead anywhere.

But yeah HN should just use browser equivalence.

But totally undefined and all browsers do their own thing for what's entered in the address bar — there's more consistency in URLs in content, and that doesn't do stuff like normalising '。' but does do the percent-encoded case (for unreserved characters, as the spec says).

Following what the spec says for eqivilance makes sense, at least. Anything drastic is technically treating distinct URLs as equivilant.

Without actually checking redirects, not breaking a few edge cases is much better than a few submissions being duplicated.
Or it could check for a 3xx HTTP status.