|
|
|
|
|
by softwaredoug
1677 days ago
|
|
On second read-thru, I think the author is maybe(?) describing assumptions behind WAND and relevance algos that benefit from it? Maybe not some overarching statement about relevance per-se? But it's mixed in with statements about relevance / what Lucene does that are mostly incorrect... For example, he says > However, as you can see, this vector space model does not explicitly require a higher ranking document to contain more query terms than a lower ranking one. Well the way you get a higher similarity in a vector-space model is matching more terms. The caveat being that IDF and field length makes you also consider a term's specificity. So if you search for 'luke skywalker' you care more about the 'skywalker' match than the 'luke' match. But a match on BOTH 'luke skywalker' would score higher (field lengths being constant) |
|
https://lucene.apache.org/core/3_5_0/scoring.html
"Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms scores lower than a different document with only one of the query terms."
See? Lucene people know about it but they just do not think it is a problem.
But I do.
T-Wand, though also uses vector space model, makes sure it is the case.