| HN Mirror

The code is open source, in case you are interested in reading the source code!

https://github.com/capjamesg/nanosearch/blob/main/nanosearch...

The algorithm of what happens is:

  1. You provide a sitemap.
  2. All URLs in the sitemap are downloaded.
  3. The HTML from each URL is read, extracting the page title, description, and contents.
  4. The contents are processed using a search algorithm. This tool supports TF/IDF and BM25, two commonly-used search algorithms. I use Python packages that implement these since there are many people who have implemented these algorithms reliably.
  5. A link graph is calculated that tracks all links between all pages.
  6. When you run a search, the algorithm you chose (BM25 or TF/IDF) will run to find related documents. This is a keyword search. Then, results are weighed by the number of links to the page. This weight is useful if a site talks a lot about topics with the same keywords; by using links as a ranking factor, posts that are more connected to others will be elevated in search. Google pioneered the idea that links are "votes" on the relevance of content (although this tool doesn't use PageRank like Google).