| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jankovicsandras 598 days ago

This is a good point and was a difficult design decision. The reasons for changing the API are:

- easier to use with untokenized corpus and questions

- to fix issues with the tokenizing ( e.g. https://github.com/dorianbrown/rank_bm25/issues/38 ); also rank_bm25 provides no default tokenizer, a naive split-on-whitespace is a wrong choice

- considerably simplify the code (way less SLOC)

- point out the similarities of the algorithms for educational purpuses / further development

In practice, the differences are minimal ( see Example 3: comparison with rank_bm25 ).