Hacker News new | ask | show | jobs
by heinrichhartman 2161 days ago
I think this is just unrealistic. Let's look at this example:

    http://www.pathfinder.com/money/moneydaily/1998/981212.moneyonline.html
This consists of:

0. Access protocol

1. Hostname/DNS name

2. Arbitrary chosen path hirarchy

3. File extension

This is really a description where to find a document ("locator" not "identifier"). So, if you are:

- re-organizing / cleanup your file structure

- change or hide the file extension

- enable HTTPS

- migrating files to a different domain name

This WILL change the URL. What are you going to do? Not cleanup your space anymore? Stick to HTTP? So URLs DO change. That's just the reality.

If you want something that does not change, don't link to a location but link to content directly: E.g.

- git hashes do not change

- torrent/magnet Links don't change

- IPSFS links do not change.

Or use a central authority, that stewards the identifier:

- DOI numbers don't change

- ISBN numbers don't change

7 comments

> What are you going to do?

The article addresses this by reminding you that though URIs often look like paths, they can be aribtrarily mapped.

By all means move the resource, but put a redirect under the old URI. This means old links continue to work, which is the key point of the article.

Yes. Have you tried to do that even for moderately complex sites?

I have tried to do it a few times, and eventually just gave up. Carrying forward bad naming decisions from the past, is tremendous effort. When cleaning up the house, I also don't leave around sticky notes at the places where I removed documents from.

On top of this:

- When using static site generators, it's not even possible to do 301 redirects (you would have to ugly slow JS version).

- It does not help if you don't own the old DNS name anymore.

> When using static site generators, it's not even possible to do 301 redirects (you would have to ugly slow JS version).

That isn't always true, depending on your choice of web server. You can use mod_rewrite rules in Apache's .htaccess files so if your generator is aware of previous URLs for given content it could generate these to 30x redirect visitors and search bots to the new right place.

Off the top of my head I'm not aware of a tool that does this, but it is certainly possible. It would need to track the old content/layout so you'd need the content in a DB with history tracking (or a source control system) or the tool could leave information about previous generations on each run for later reading. Or it could simply rely on the user defining redirect links for it to include in the generated site.

Of course if you are using a static site generator for maximum efficiency you probably aren't using Apache with .htaccess processing enabled! I suppose a generator could also generate a config fragment for nginx/other similarly though that would not be useful if you are publishing via a web server where you do not have appropriately privileged access to make changes to that.

I have done this for a moderately complex site, it was a bit of work, but not the end of the world. I'm sure some things went missing, but we got 99% of it, which I consider successful enough.

You can do 301s statically, by generating whatever your particular version of an .htaccess file is in place. Or, you can generate the HTML files with the meta-redirect header in place.

The DNS is obviously an issue, but that's not really relevant. The article is advocating for URLs not changing. It's not saying that they mustn't change, just that it's really cool for everyone if they don't.

>When using static site generators, it's not even possible to do 301 redirects (you would have to ugly slow JS version).

I know it's 2020 and all that, but sometimes you don't need 20 MB of minified JS to achieve something: https://en.wikipedia.org/wiki/Meta_refresh

Using a SSG does not mean you don't have a intelligent server that can't do redirects. That's a limitation of certain web hosts (GitHub Pages for eg).

Netlify allows dead simple redirects, and so do most other static hosting platforms.

Even GitHub Pages behind Cloudflare is capable of issuing a 301.
A classic way to do these redirects is on the front web server itself: .htaccess, nginx config, etc.

When you change the structure of your urls, you can generally generate redirect rules to translate old urls to the new structure. Or run a script to individually map each old url to its new one.

Note: I've never done the layter for more than a few hundreds urls, I don't know if it scales well for a very large site

> When cleaning up the house, I also don't leave around sticky notes at the places where I removed documents from.

This is a poor analogy. Perhaps “I’m a librarian for a library with thousands or millions of users, and when I rearrange the books, I don’t leave sticky notes pointing to the new locations”

I don't know about this specific website (or if it even exists), but the 981212 part of the "link" looks like the identifier to me. The way many sites are set up, most of the link is "locating", but it also contains a unique "identifying" component (page/post/item id). You can remove almost all of the locating parts and the identifier still works so the link can be resistant to everything from just a title change to a complete restructuring (as long as the IDs are kept).
The text below that example says that the .html ought not to be there. That's clearly not intended to be part of what that example is demonstrating, but I guess it's just there because they were going for real world examples.

The arbitrary path hierarchy is not so bad. Better than every URI just being https://domainname.com/meaninglesshash. You can also stick a short prefix in front, like https://domainname.com/v1/money/1998/etc, so that all documents created after a reorg can use a different prefix. If your reorg is so severe that there's no way to keep access to old documents under their old URI, even if it has its own prefix, it seem unlikely they'll be made available in any other location. In that context you can imagine the article is imploring you "please don't delete access to old documents".

Your remaining objections, for host name and access, boil down to "don't use URIs at all, and don't bother to avoid changing them". As I type this comment I'm starting to realise that was your whole point, but it was a bit buried alongside minor objections to this particular example. It's also perhaps a bit of an extreme point of view. Referencing a git hash alongside a URI is sensible, but on its own it's pretty useless, and many web pages won't have anything analogous.

Would say the most excusable part is the protocol but of course that generally ends up being a 301, albeit the URI has indeed changed.

Hostname, well perhaps if a company has been merged/sold.

Path/query is really down to information architecture and planning that early on can go a long way, e.g. contact, faq belonging in a /site subdirectory.

File extension doesn't really matter nowadays

Main thing is there's no technical reasons for the change. I recently saw someone wanting to change the URLs of their entire site because they now use PHP instead of ASP. They could use their webserver for PHP to deal with those pages and save the outside world a redirect and twice as many URLs to think about.

> - enable HTTPS

I really wish HTTPS hadn't changed the URL scheme so you could host both HTTPS and fallback HTTP under the same URL. However most HTTPS sites will redirect http://domain/(.*) to https://domain/$1 (or at least they should) so this doesn't need to break URLs.

> This is really a description where to find a document ("locator" not "identifier").

This is excellent. I wish more people would make your distinction between URL and URI. URIs really are supposed to be IDs. When put in that parlance, it's hard to say that IDs should change willy-nilly on the web. That said, I think that does deprioritize a global hierarchy / taxonomy for a fundamentally graph-like data structure.

> If you want something that does not change, don't link to a location but link to content directly

I see motivation for this, but I've personally found this to be equally as problematic as blending the distinction between URIs and URLs. Most "depth" and hierarchy that's in URLs is stuff that ideally would be in the domain part of the URL. For instance:

http://company.com/blog/2019/02/10-cool-tips-you-wouldnt-bel...

would really map to:

http://blog.company.com/2019/02/10-cool-tips-you-wouldnt-bel...

and the "blog" subdomain would be owned by a team. You could imagine "payments", "orders", or whatever combo of relevant subdomains (or sub-subdomains). In my experience this hierarchical federation within an organization is not only natural, it's inevitable: Conway's Law.

So I do very much believe that the hierarchy of content and data is possible without needing a flat keyspace of ids. Just off the top of my head, issues with the flat keyspace are things like ownership of namespaces, authorization, resource assignment, different types of formats/content for the same underlying resources etc. Hierarchies really do scale and there's reason for them.

That said, most sites (the effective 'www' part of the domain) are really materialized _views_ of the underlying structure of the site/org. The web is fundamentally built to do this mashup of different views. Having your "location" be considered a reference "view" to the underlying "identity" "data" would go a long way to fixing stuff like this.

ISBN numbers are notoriously doubly- and re-allocated.

DOI and ISBN are as much locations as URL.

Content based URN are the only option.