| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by CivBase 1515 days ago

I don't even think you need a machine learning algorithm. I follow a simple process all the time to achieve this.

  load the page with the original URL
  for each part of the URL:
    remove that part of the URL
    load the page with the modified URL
    if the page rendered differently:
      put that part back

You'd need to incorporate an ad blocker, otherwise changing ads on each reload could screw it up. Of course, you'd also probably want to hard-code in logic for popular websites like Amazon to avoid wasting time with reloads.

And I know there are limitations to the concept. It wont work for pages which require authentication or any kind of session data. It wont work for pages which are intentionally dynamic. It wont work for sites which cannot be accessed by the service. It wont be useful for pages who simply have long URLs. But it probably covers the many common use cases for a URL shortener, and you could always fall back on traditional shortening methods (redirecting) when it doesn't work.

1 comments

karlicoss 1515 days ago

Oh nice, I like it!

So it basically automates detecting useful bits for a particular URL, but it's kind of time consuming and flaky. It could be very helpful to populate the 'rules' database though, and then this database could be shared with other people so they don't have to scrape.

I guess when I said ML (or preferably some fuzzy algorithm/heuristic), I was referring to generifying rules so they also work on the sites not in the rules database. If humans can detect garbage in the URL looking at a few examples, the computer can too :)

link