Hacker News new | ask | show | jobs
by softwaredoug 1670 days ago
No there’s no guarantee. A very strong bias, but no absolute guarantee. Most people use AND queries by default or set a high enough min should match.

Those settings aside, arguably there are cases where fewer term matches could be more relevant.

For the search “Luke skywalker”

A tweet for example mentioning the term “Skywalker” once has much higher “Star Wars” aboutness than a move that uses skywalker one page and Luke pages apart.

That is the information density of Star Wars type content is far higher in the tweet, whereas eventually a book ends up using most English somewhere.

1 comments

You either has or has no guarantee, that's what the word "guarantee" means. Don't mince words.

You had a wrong understanding. Now you are corrected. Let's move on.

Nobody said anything about "fewer term matches could not be more relevant". You are just making up straw men here. That's not the discussion we are having.

What I said is this, from a user point of view, it's not good to have a document containing fewer query terms to rank higher. This is a fact that even Lucene acknowledge (at least when they were version 3.5.0.). You have nothing to counter this fact.

But you’re arguing that such a guarantee ALWAYS is most relevant. When that’s not always the case.

There’s been extensive research justification behind the vector-space model. BM25 is the 25th iteration of a model and well tuned BM25 holds the highest non nueral performance on many tasks including question answering[1]. Research has long found including factors other than total term matches matters. Such as IDF[2] and field length[3].

Have you benchmarked your relevance assumptions similarly? If so I’d love to see them and learn more!

1 - https://www.elastic.co/blog/improving-search-relevance-with-...

2- https://www.researchgate.net/publication/238123710_Understan...

3 - http://sifaka.cs.uiuc.edu/course/410s12/mir.pdf

You are making up straw man again. Did I make that claim?

I was simply motivate my work, pointing out it's a problem that the current generation of search engine does not address.

What is "relevance", it is of course a context sensitive question.

You talk as if BM25 is the gold standard when it is not.

The research on this is all over the place. I just read an article that says that BM25 is way worse than alternative language models. You don't have to look far, for example this one that talks about Wand:

https://dl.acm.org/doi/10.1145/2537734.2537744

You know why I quit academia? It is useless arguments and virtual signalings like these.

I'd rather go out and build a damn thing that people like to use.

I merely pointed out that there's a problem, and I have a solution. I did not claim that my problem and my solution solve all problems. Isn't this obvious?

In my problem, my solution beats Lucene. It's as simple as that.

So if Lucene wants to be this infinitely configurable search library, it would be advisable to offer my solution as an option, or offer an better one that does something similar.

So far I have not seen any takers, only excuses.

The title of the section quoted is

"Better Relevance"

and I am simply saying, arguably, you can't make that claim without evidence.

(That said, I do find much to like about the article and the WAND / T-WAND explanation. I'm only pointing out you can't claim 'better relevance' without a benchmark to go with it.)