Hacker News new | ask | show | jobs
by softwaredoug 939 days ago
Right now it’s the typical inverted index

term -> list of doc ids

And the main purpose of the data structure is to recover term frequencies and document frequencies. We also store positional information to allow phrase matching.

BM25 of course is just one such way of using these stats. But you can also get raw termfreqs and docfreqs of matching terms and do whatever you want with them mathematically :).

The BM25 here tries to align to Lucenes internal BM25 calculation.