| I cannot continue reading after this following “declaration”… Author should take a look at the Wikipedia page for TF-IDF. > As someone who has a Ph.D. in Human-computer Interaction ;-), I feel like I am entitled to define a condition of "good" in relevance here. I hereby declare that: >> A good top-K algorithm should rank a document containing more user query terms higher than a document containing less number of user query terms. > This makes perfect sense. Right? Also, “most search engines” don’t use vector space model as the only way to rank result, for example, page rank. Edit: in some search scenarios finding the documents with the most query terms make sense, but Lucene can also rank using this metric. Still, applaud the author's effort in digging into research literature. Search relevance is very hard and standard off the shelf metrics like TF-IDF and page rank are often not enough. Good search usually requires deep understanding of the specific subject domain and hand-tuning tons of signals, many of which aren't even strictly based on search terms (e.g., previously purchased products on a store's website, geographic location, trending results). |
Relevance is really subjective, domain specific, requires intense amount of measurement and testing and many different ranking signals. Lucene is a toolbox for crafting many of these signals.