Hacker News new | ask | show | jobs
by karlicoss 1515 days ago
It's kind of tricky to do in general case, e.g. even hackernews is keeping meaningful semantic information in id= query parameter.

Because of that it ultimately needs to a site-specific database/algorithm, perhaps with a fallback to the default behaviour like simply cleaning up the most common garbage like (_encoding/usg/etc). I suspect it's possible to use some sort of machine learning to guess the meaningful parts of the URL path/query/fragments, but even for that we need some human curation for the training set. I wish we could collaborate on a shared database/library for that, have sketched some ideas/applications/prior art here: https://beepb00p.xyz/exobrain/projects/cannon.html

I started thinking about it since I have a similar problem in Promnesia (https://github.com/karlicoss/promnesia#readme), a knowledge management tool I'm working on. Ideally I want to normalise URLS, so they address the exact bit of information, and nothing more.

2 comments

It doesn't cover everyone, but you should know about rel="canonical" (https://datatracker.ietf.org/doc/html/rfc6596). For example, Amazon helpfully hints that this messy Kindle Paperwhite link (https://www.amazon.com/Kindle-Paperwhite-Signature-Essential...) is actually https://www.amazon.com/Kindle-Paperwhite-Signature-Essential....

Additionally, CleanURLs to the rescue! https://github.com/ClearURLs

Yeah, sadly, to get the canonical attribute, you need to fetch the URL first (which is slow and wasteful). Also sometimes canonical would still be different on the desktop and mobile version of the site, so it still has to be normalised after that
I don't even think you need a machine learning algorithm. I follow a simple process all the time to achieve this.

  load the page with the original URL
  for each part of the URL:
    remove that part of the URL
    load the page with the modified URL
    if the page rendered differently:
      put that part back
You'd need to incorporate an ad blocker, otherwise changing ads on each reload could screw it up. Of course, you'd also probably want to hard-code in logic for popular websites like Amazon to avoid wasting time with reloads.

And I know there are limitations to the concept. It wont work for pages which require authentication or any kind of session data. It wont work for pages which are intentionally dynamic. It wont work for sites which cannot be accessed by the service. It wont be useful for pages who simply have long URLs. But it probably covers the many common use cases for a URL shortener, and you could always fall back on traditional shortening methods (redirecting) when it doesn't work.

Oh nice, I like it!

So it basically automates detecting useful bits for a particular URL, but it's kind of time consuming and flaky. It could be very helpful to populate the 'rules' database though, and then this database could be shared with other people so they don't have to scrape.

I guess when I said ML (or preferably some fuzzy algorithm/heuristic), I was referring to generifying rules so they also work on the sites not in the rules database. If humans can detect garbage in the URL looking at a few examples, the computer can too :)