Hacker News new | ask | show | jobs
by ot 4693 days ago
Just out of curiosity, I submitted this yesterday:

https://news.ycombinator.com/item?id=6190603

The URL was

    http://www.python.org/dev/peps/pep-0450/
While this is

    http://www.python.org/dev/peps/pep-0450
That is, exactly the same except for a trailing slash. Doesn't the deduplication algorithm handle this case?
3 comments

Technically speaking, they are separate URLs that may lead to separate resources. For example, Google engine treats them as separate URLs.

That's the reason why opening http://www.python.org/dev/peps/pep-0450 redirects to http://www.python.org/dev/peps/pep-0450/ . HN engine should follow redirect to avoid situations like this.

Technically speaking, there are no equivalent URLs in general, different strings may lead to different resources.

Still, there are a number of common sense heuristics to normalize URLs, that HN applies to do de-duplication. I was wondering what is the rationale for not having trailing slash removal among them. I mean, is there any legitimate website that serves a different resource if you remove the trailing slash?

Per RFC3986/7, http://example.com/%60 and http://example.com/a are equivilant. (Indeed, all major browsers will request the latter regardless of what is input.) Equally, punycode encoded IRIs and the original IRI are equivilance. There is a whole section on equivilance in both of the RFCs (3967 includes 3986 by reference, so is a superset).
Browser equivalence is another thing entirely. Most browsers will accept http://www。google。com (because in Japanese '。' is '.'). But if you tried to request that actual resource it doesn't lead anywhere.

But yeah HN should just use browser equivalence.

But totally undefined and all browsers do their own thing for what's entered in the address bar — there's more consistency in URLs in content, and that doesn't do stuff like normalising '。' but does do the percent-encoded case (for unreserved characters, as the spec says).

Following what the spec says for eqivilance makes sense, at least. Anything drastic is technically treating distinct URLs as equivilant.

Without actually checking redirects, not breaking a few edge cases is much better than a few submissions being duplicated.
Or it could check for a 3xx HTTP status.
I posted an ASK PG [1] about this last month ago and got an angry email from (presumably) a mod. His point was totally valid - that PG 'aint go time fo' that' which is true, but I was just hoping to draw attention to it, as opposed to demand PG drop what he's doing right now and fix it. To be honest, I was quite surprised by the tone of the email.

[1] https://news.ycombinator.com/item?id=5908075

The email wasn't from one of us.
Why should it?

For one, it provides the welcome ability to bring topics up in Hacker News again, where they might get accepted better the second or third time (e.g because more people are online at the time of the second submission).

If the "deduplication algorithm" had "handled this case", then we would only be left with the first submission (a dead discussion), whereas as it is, HN users have now caught on to this PEP news and we have a discussion going on.