|
|
|
|
|
by karlicoss
1515 days ago
|
|
It's kind of tricky to do in general case, e.g. even hackernews is keeping meaningful semantic information in id= query parameter. Because of that it ultimately needs to a site-specific database/algorithm, perhaps with a fallback to the default behaviour like simply cleaning up the most common garbage like (_encoding/usg/etc). I suspect it's possible to use some sort of machine learning to guess the meaningful parts of the URL path/query/fragments, but even for that we need some human curation for the training set. I wish we could collaborate on a shared database/library for that, have sketched some ideas/applications/prior art here: https://beepb00p.xyz/exobrain/projects/cannon.html I started thinking about it since I have a similar problem in Promnesia (https://github.com/karlicoss/promnesia#readme), a knowledge management tool I'm working on. Ideally I want to normalise URLS, so they address the exact bit of information, and nothing more. |
|
Additionally, CleanURLs to the rescue! https://github.com/ClearURLs