Hacker News new | ask | show | jobs
by yathern 3172 days ago
Great post - I quite like the stackoverflow.com style of `stackoverflow.com/questions/<question-id>/<question-title>`, where <question-title> can be changed to anything, and the link still works.

This allows for easy URL readability, while also having a unique ID.

In the context of this post (the library example) that would look like

library.com/books/1as03jf08e/Moby-Dick/

8 comments

Doing this means that:

1) there are now an infinite number of URLs for every one of your pages that may end up separately stored on various services (mitigated for only some kinds of service if you redirect to correct),

2) if the title changes the URLs distributed are now permanently wrong as they stored part of the content (and if you redirect to correct, can lead to temporary loops due to caches),

3) the URL is now extremely long and since most users don't know if a given website does this weird "part of the URL is meaningless" thing there are tons of ways of manually sharing the URL that are now extremely laborious,

4) have now made content that users think should somehow be "readable" but which doesn't even try to be canonical... so users who share the links will think "the person can read the URL, so I won't include more context" and the person receiving the links thinks "and the URL has the title, which I can trust more than what some random user adds".

The only website I have ever seen which I feel truly understands that people misuse and abuse title slugs and actively forces people to not use them is Hacker News (which truncates all URLs in a way I find glorious), which is why I am going to link to this question on Stack Exchange that will hopefully give you some better context "manually".

meta.stackexchange.com/questions/148454/why-do-stack-overflow-links-sometimes-not-work/

Many web browsers don't even show the URL anymore: the pretense that the URL should somehow be readable is increasingly difficult to defend. A URL should sometimes still be short and easy to type, but these title slug URLs don't have that property in spades.

If anything, other critical properties of a URL are that they are permanent and canonical, and neither of these properties tend to be satisfied well by websites that go with title slugs, and while including the ID in there mitigates the problem it leaves it in some confusing middle-land where part of the URL has this property and part of it doesn't.

If you are going to insist upon doing this, how about doing it using a # on the page, so at least everyone had a chance to know that it is extra, random data that can be dropped from the URL without penalty and might not come from the website and so shouldn't be trusted?

(edit to add:) BTW, if you didn't know you could do this, Twitter is most epic source of "part of the URL has no meaning" that I have ever run across as almost no one realizes it due to where it is placed in the URL.

twitter.com/realDonaldTrump/status/247076674074718208

> 1) there are now an infinite number of URLs for every one of your pages that may end up separately stored on various services (mitigated for only some kinds of service if you redirect to correct)

No need to redirect, that's what canonical links are for:

https://developer.mozilla.org/en-US/docs/Web/HTML/Link_types

I don't disagree in that I mostly dislike URL slugs, too. Except for some hub pages ("photos", "blog", etc.), a numerical ID is more than enough. But the combination of ordering and display modes and filtering can still amount to a huge number of combinations, so canonical links are still needed - to have as many options for the user as possible and allow them all to be bookmarked, but also give search engines a hint on what minor permutations they can ignore safely.

I wish search engines would completely ignore words in the URL. If it's not in the page (or the "metadata" of actual content on pages linking to it, and so on), screw the URL. If it is in the page (and the URL), you don't need the URL. As long as they are incentivized, we'll have fugly URL schemes.

1) and 2) are not a problem if the server accepts any value for the title token (which is the case on stack exchange)

3) is not a problem for hyperlinks (url not visible) or for even direct links (not burdensome length), and if you care about a short url an even shorter form is available

4) seems like a feature? the person sending the link will only ever include as much information as they deem necessary anyway. If the recipient wants more info they'll either request it or click the link.

Trust is an interesting point, but if you can equally put literally anything in the client side anchor (eg. meta.stackexchange.com/questions/148454/#definitely-not-a-rick-roll) so I don't see what a viable alternative would be.

The usual way I've seen to deal with this kind of ambiguity is by doing a 301 redirect so that bookmarks get changed and the url in the address bar is also changed. It doesn't fix external parties linking to the site with the now deprecated url but there was never anything you could reasonably do about that.

> If you are going to insist upon doing this, how about doing it using a # on the page, so at least everyone had a chance to know that it is extra, random data that can be dropped from the URL without penalty and might not come from the website and so shouldn't be trusted?

The fragment doesn't get indexed by search engines so not many will see it. Along with that, in my understanding, having something human readable in the URL helps with SEO in at least google an bing so doing this could hurt your search rankings which isn't a good thing.

Minor correction, because dealing with this is a part of my job: Almost no browsers have implemented changing bookmarks in response to 301 redirects. Link has further context and some testing.

https://superuser.com/questions/151366/do-browsers-change-ur...

Interesting, I had always gone with that since the RFC says it should happen. Good to know.
301 may be dangerous, because browsers cache them.

Suppose the client follows a link to old-slug after the slug has been changed to new-slug. The server responds 301 → new-slug. The client caches that redirect, so that if you request old-slug it will immediately take you to new-slug without querying the server.

Then the object’s slug is changed back to old-slug (perhaps the change was made in error). Now a request to new-slug produces a 301 → old-slug. This likewise is cached, and a client may new be stuck in an infinite redirect loop.

I’m not sure if this is actually what browsers do; they might detect the loop and decide to throw away their cached redirects. I haven’t tested it; but I wouldn’t count on it.

I just tried it locally in Chrome/Firefox/Safari. Ends up working, no issue.
This used to be a serious problem. It may be fixed now, but Firefox would eternally cache 301s unless explicitly told not to. This is why I configure all of my servers to disallow caching of 301s.
Minor nitpick, I'm not sure if exact match in URL slugs matters from Google's perspective very much. I do read that searchers' eyes can be drawn towards the exact match (which are frequently bolded in the SERPs), possibly leading to a higher clickthrough rate.
It's been a while since I was looking at how google's crawler worked. For items that had multiple ways of navigating there, I remember using the link rel="canonical" to let google know where the page would have been if not for the category information etc in the url.
1: so what? I use this for my blog (cryptologie.net) and this has never been a problem. Search engines handle that quite well.

2: no. The URL is not wrong. Rather it won’t describe the content perfectly anymore. If this is an issue you can attribute a new ID to your page.

3: that’s why you have url shorteners. But what’s wrong with a long url? And how does it complicates sharing it? To share you copy/paste the url. Nothing changed. And now the url describes the content! (That’s the reason we do it.)

4: that’s a good thing!

So yeah. I’ll keep doing this for my blog and I hope websites like SO keep doing that as well

>>> the pretense that the URL should somehow be readable is increasingly difficult to defend

I think I have a defense for this. I consistently long press links on mobile to see the url before deciding whether to load the page or not. Just to see if I can be bothered.

> 3) the URL is now extremely long and since most users don't know if a given website does this weird "part of the URL is meaningless" thing there are tons of ways of manually sharing the URL that are now extremely laborious,

I'm missing something -- what does length have to do with the difficulty of sharing a URL? I can't remember the last time I typed out any URL past the TLD.

> 1) there are now an infinite number of URLs for every one of your pages that may end up separately stored on various services

Sometimes I call this a URL black hole.

In all fairness, black holes are everywhere when you consider that most web servers ignore unrecognized query params for routing. Examine this URL:

https://news.ycombinator.com/item?t=choosing-between-names-a...

Of course the difference is that Hacker News doesn't disseminate URLs of that form, but that doesn't mean someone couldn't pollute the internet with them.

> there are now an infinite number of URLs for every one of your pages that may end up separately stored on various services

What services? Web crawlers? I'm sure the ones I would care about are smart enough to know how this works. There are many ways infinite valid URLs can be made. Query params, subdomains and hashroutes to name a few.

> if the title changes the URLs distributed are now permanently wrong as they stored part of the content (and if you redirect to correct, can lead to temporary loops due to caches),

You don't redirect. The server doesn't even look at the slug part of the URL for routing purposes. You can change the url with javascript post-load if it bothers you (as stackoverflow does). Cache loops are an entirely avoidable problem here.

> the URL is now extremely long and since most users don't know if a given website does this weird "part of the URL is meaningless" thing there are tons of ways of manually sharing the URL that are now extremely laborious

Extremely long and extremely laborious seems a bit of an exaggeration. Most users copy and paste, no? Adding a few characters of a human readable tag doesn't warrant this response I feel. Especially when the benefit means that if I copy and paste a url into someplace, I can quickly error-check it to make sure it's the title I mean. When using the share button, the de-slugged URL can be given.

> users who share the links will think "the person can read the URL, so I won't include more context" and the person receiving the links thinks "and the URL has the title, which I can trust more than what some random user adds".

I guess? I wont bother with a rebuttal because this issue seems so minor. The benefit far outweighs some users maybe providing less context because the link url made them do it. If someone says "My typescript wont compile because of my constructor overloading or something please help", I can send stuff like:

stackoverflow.com/questions/35998629/typescript-constructor-overload-with-empty-constructor

stackoverflow.com/questions/26155054/how-can-i-do-constructor-overloading-in-a-derived-class-in-typescript

which I think is so much more useful than just IDs.

> Many web browsers don't even show the URL anymore: the pretense that the URL should somehow be readable is increasingly difficult to defend

Most do. Even still, the address bar is not the only place a URL is seen. Links in text all over the internet has URLs - particularly when shared in unformatted text (ie not anchor tags). And URLs should be readable to some extent. Would you suggest that all pages might as well be unique IDs? A URL like:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

Is much better than

https://developer.mozilla.org?articleId=10957348203758

> how about doing it using a # on the page, so at least everyone had a chance to know that it is extra

Fair enough - I think that's a fine idea.

I don't have a direct piece of evidence, but most users don't even know about ctrl-f, so I think they don't copy and paste. They click (or tap, these days) on links. https://www.theatlantic.com/technology/archive/2011/08/crazy...

Most users click links.

I meant in the context of sharing links, either on a board like this or in a text. But that does bring a up a good point of how many users know how to copy/paste?

Among all internet users, I would conservatively assume 30%+ do. Among people who have posted a link to social media or forums, I would assume %80+. But I'd be interested to see how off I am.

There's a reason there are those share buttons on every website that's chasing viral traffic.

I suspect most people who share on Facebook share via those, or via the Facebook app's own internal web viewer. I would assume Twitter is a bit more savvy, but I still would not bet strongly that a majority of people on Twitter know about copy-paste.

One thing I find useful about slugs in URLs is it lets me see that I used the intended link when I paste it
> there are now an infinite number of URLs for every one of your pages that may end up separately stored on various services (mitigated for only some kinds of service if you redirect to correct),

Not all services allow you to change the title (and therefore mutate the slug) but situations where changing the title changes the slug are so infrequent (and in this case, consequences nearly so inconsequential) that this is a problem mostly in theory. It's a miniscule price to pay for semantically useful URLs.

Amazon has been doing this URL scheme since many years ago, e.g.: https://www.amazon.com/Optional-product-name/dp/A00BCDEF00ID...
Strangely enough, discourse uses the following style:

https://meta.discourse.org/t/deleted-topics-where-are-they/2...

/t/ for topic, slug for readability, then a topic id and at last a reply id.

and your comment is the perfect demonstration why: when truncated, the id gets cut off before the slug.
Which is... something you don't want to happen, right?
To be fair, if you truncate a url for anything other than display, it's toast (which is why no sane person truncates URLs except for display).
Right, I was thinking purely display truncation, like here. Surviving copy-paste from here or other actual truncation it's bad, true.
So does reddit. Go to any comment section. You can remove the latter part with the title and only leave the identifier, and the link will still work. The short link actually only contains the identifier.
This seems like it's vulnerable to some form of abuse.

library.com/books/1as03jf08e/Moby-Dick/

library.com/books/1as03jf08e/Hitchhikers-Guide-to-the-Galaxy

Now lead to the same place...

eh. You can do that with query strings and hashes in URLS anyway. https://news.ycombinator.com/user?id=digikata&profile=bad-pe...
standards wise, you know the part after ? is variable though...
variable? Not sure if I 100% get what you're saying, but what I know is that https://news.ycombinator.com/user?id=digikataWaitNoThisOther... won't go to the same place as your user profile. There's standards, and then there's "Standards".
You would redirect to the canonical one.
I think the concern is in the way it obscures the target. Replace "Moby Dick" with a Chuck Tingle (warning, probably nsfw) book. Now that second link is a serious problem.
I see what you're saying, but it doesn't seem like much more than a funny gag you might pull on a friend.

If a website is concerned about that case, then instead of letting it inform their URL design, they should have a "Warning: Adult content. [Continue] [Back]" interstitial like Reddit or Steam.

I'm not even sure it's a serious problem - a possible annoyance, and perhaps, for a spammy site owner, maybe even a feature. But as a web user, I'm not really fond of that added uncertainty.
You don't necessarily have to redirect, but you should at least include `<link rel="canonical" href="..." />` (as given example StackOverflow does) so that search robots and other website (scrape and/or API) clients know which one is the canonical path, to avoid duplicate efforts.
That only works for some crawlers. Certainly not for users. Meanwhile, everything obeys redirects.

Since you bring up Stack Overflow, notice that they do the canonical redirect. Change the title in the URL and you'll get redirected.

Yes, the best approach is probably both, but it is crawlers that it matters more that they know the canonical paths more than users, and a crawler ignoring rel="canonical" is likely not much better than/as buggy as a crawler ignoring robots.txt; it's a specification they can ignore at their own peril.
A bunch of news sites use similar URL parsing; they tend to not care about the "slug" either. I think this is, in the general case, the best way.
As long as you provide a canonical URL.
Or if you redirect any non-canonical URLs to the canonical one.
Goodreads does something similar, which I also appreciate. An example: https://www.goodreads.com/book/show/22733729-the-long-way-to...

You can take off any of the words past the numeric ID and it still works just fine.

Ha, I didn't realize that you could change the question title or even leave it out altogether without breaking the link. Neat!
This is the way I've always done it as well, and super easy to implement.

For example.

router.get('/article/:article_shortid*?',function(req,res){ });

catches /article/28424824/this-is-my-article, and also /article/28424824