I was trying to learn more about the ranking algorithm that Alexandria uses, and I was a bit confused by the documentation on Github for it. Would I be correct in that it uses "Harmonic Centrality" (http://vigna.di.unimi.it/ftp/papers/AxiomsForCentrality.pdf) at least for part of the algorithm?
Yes our documentation is probably pretty confusing. It works like this, the base score for all URLs to a specific domain is the harmonic centrality (hc).
Then we have two indexes, one with URLs and one with links (we index the link text).
Then we first make a search on the links, then on the URLs. We then update the score of the urls based on the links with this formula:
domain_score = expm1(5 * link.m_score) + 0.1;
url_score = expm1(10 * link.m_score) + 0.1;
then we add the domain and url score to url.m_score
where link.m_score is the HC of the source domain.
The main scoring function seems to be index_builder<data_record>::calculate_score_for_record() in line 296 of https://github.com/alexandria-org/alexandria/blob/main/src/i..., and it mentions support for BM25 (Spärck Jones, Walker and Robertson, 1976) and TFIDF (Spärck Jones, 1972) term weighting, pointing to the respective Wikipedia pages.
Yes our documentation is probably pretty confusing. It works like this, the base score for all URLs to a specific domain is the harmonic centrality (hc). Then we have two indexes, one with URLs and one with links (we index the link text). Then we first make a search on the links, then on the URLs. We then update the score of the urls based on the links with this formula: domain_score = expm1(5 * link.m_score) + 0.1; url_score = expm1(10 * link.m_score) + 0.1;
then we add the domain and url score to url.m_score
where link.m_score is the HC of the source domain.