| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ot 4693 days ago

Just out of curiosity, I submitted this yesterday:

https://news.ycombinator.com/item?id=6190603

The URL was

    http://www.python.org/dev/peps/pep-0450/

While this is

    http://www.python.org/dev/peps/pep-0450

That is, exactly the same except for a trailing slash. Doesn't the deduplication algorithm handle this case?

3 comments

daGrevis 4693 days ago

Technically speaking, they are separate URLs that may lead to separate resources. For example, Google engine treats them as separate URLs.

That's the reason why opening http://www.python.org/dev/peps/pep-0450 redirects to http://www.python.org/dev/peps/pep-0450/ . HN engine should follow redirect to avoid situations like this.

link

ot 4693 days ago

Technically speaking, there are no equivalent URLs in general, different strings may lead to different resources.

Still, there are a number of common sense heuristics to normalize URLs, that HN applies to do de-duplication. I was wondering what is the rationale for not having trailing slash removal among them. I mean, is there any legitimate website that serves a different resource if you remove the trailing slash?

link

gsnedders 4693 days ago

Per RFC3986/7, http://example.com/%60 and http://example.com/a are equivilant. (Indeed, all major browsers will request the latter regardless of what is input.) Equally, punycode encoded IRIs and the original IRI are equivilance. There is a whole section on equivilance in both of the RFCs (3967 includes 3986 by reference, so is a superset).

link

U2EF1 4693 days ago

Browser equivalence is another thing entirely. Most browsers will accept http://www。google。com (because in Japanese '。' is '.'). But if you tried to request that actual resource it doesn't lead anywhere.

But yeah HN should just use browser equivalence.

link

gsnedders 4692 days ago

But totally undefined and all browsers do their own thing for what's entered in the address bar — there's more consistency in URLs in content, and that doesn't do stuff like normalising '。' but does do the percent-encoded case (for unreserved characters, as the spec says).

Following what the spec says for eqivilance makes sense, at least. Anything drastic is technically treating distinct URLs as equivilant.

link

keeperofdakeys 4693 days ago

Without actually checking redirects, not breaking a few edge cases is much better than a few submissions being duplicated.

link

zeckalpha 4693 days ago

Or it could check for a 3xx HTTP status.

link

alexholehouse 4693 days ago

I posted an ASK PG [1] about this last month ago and got an angry email from (presumably) a mod. His point was totally valid - that PG 'aint go time fo' that' which is true, but I was just hoping to draw attention to it, as opposed to demand PG drop what he's doing right now and fix it. To be honest, I was quite surprised by the tone of the email.

[1] https://news.ycombinator.com/item?id=5908075

link

pg 4693 days ago

The email wasn't from one of us.

link

coldtea 4693 days ago

Why should it?

For one, it provides the welcome ability to bring topics up in Hacker News again, where they might get accepted better the second or third time (e.g because more people are online at the time of the second submission).

If the "deduplication algorithm" had "handled this case", then we would only be left with the first submission (a dead discussion), whereas as it is, HN users have now caught on to this PEP news and we have a discussion going on.

link