|
|
|
|
|
by huahaiy
1670 days ago
|
|
There's no guarantee in pure vector space model that it is the case. Your understanding is way off. This is a caveat that Lucene prominently put on their Web page. https://lucene.apache.org/core/3_5_0/scoring.html "Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms scores lower than a different document with only one of the query terms." See? Lucene people know about it but they just do not think it is a problem. But I do. T-Wand, though also uses vector space model, makes sure it is the case. |
|
Those settings aside, arguably there are cases where fewer term matches could be more relevant.
For the search “Luke skywalker”
A tweet for example mentioning the term “Skywalker” once has much higher “Star Wars” aboutness than a move that uses skywalker one page and Luke pages apart.
That is the information density of Star Wars type content is far higher in the tweet, whereas eventually a book ends up using most English somewhere.