Hacker News new | ask | show | jobs
by softwaredoug 1670 days ago
As author of Relevant Search and contriburor of AI powered search I endorse this :)

Relevance is really subjective, domain specific, requires intense amount of measurement and testing and many different ranking signals. Lucene is a toolbox for crafting many of these signals.

1 comments

On second read-thru, I think the author is maybe(?) describing assumptions behind WAND and relevance algos that benefit from it? Maybe not some overarching statement about relevance per-se? But it's mixed in with statements about relevance / what Lucene does that are mostly incorrect...

For example, he says

> However, as you can see, this vector space model does not explicitly require a higher ranking document to contain more query terms than a lower ranking one.

Well the way you get a higher similarity in a vector-space model is matching more terms. The caveat being that IDF and field length makes you also consider a term's specificity. So if you search for 'luke skywalker' you care more about the 'skywalker' match than the 'luke' match. But a match on BOTH 'luke skywalker' would score higher (field lengths being constant)

There's no guarantee in pure vector space model that it is the case. Your understanding is way off. This is a caveat that Lucene prominently put on their Web page.

https://lucene.apache.org/core/3_5_0/scoring.html

"Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms scores lower than a different document with only one of the query terms."

See? Lucene people know about it but they just do not think it is a problem.

But I do.

T-Wand, though also uses vector space model, makes sure it is the case.

No there’s no guarantee. A very strong bias, but no absolute guarantee. Most people use AND queries by default or set a high enough min should match.

Those settings aside, arguably there are cases where fewer term matches could be more relevant.

For the search “Luke skywalker”

A tweet for example mentioning the term “Skywalker” once has much higher “Star Wars” aboutness than a move that uses skywalker one page and Luke pages apart.

That is the information density of Star Wars type content is far higher in the tweet, whereas eventually a book ends up using most English somewhere.

You either has or has no guarantee, that's what the word "guarantee" means. Don't mince words.

You had a wrong understanding. Now you are corrected. Let's move on.

Nobody said anything about "fewer term matches could not be more relevant". You are just making up straw men here. That's not the discussion we are having.

What I said is this, from a user point of view, it's not good to have a document containing fewer query terms to rank higher. This is a fact that even Lucene acknowledge (at least when they were version 3.5.0.). You have nothing to counter this fact.

But you’re arguing that such a guarantee ALWAYS is most relevant. When that’s not always the case.

There’s been extensive research justification behind the vector-space model. BM25 is the 25th iteration of a model and well tuned BM25 holds the highest non nueral performance on many tasks including question answering[1]. Research has long found including factors other than total term matches matters. Such as IDF[2] and field length[3].

Have you benchmarked your relevance assumptions similarly? If so I’d love to see them and learn more!

1 - https://www.elastic.co/blog/improving-search-relevance-with-...

2- https://www.researchgate.net/publication/238123710_Understan...

3 - http://sifaka.cs.uiuc.edu/course/410s12/mir.pdf

You are making up straw man again. Did I make that claim?

I was simply motivate my work, pointing out it's a problem that the current generation of search engine does not address.

What is "relevance", it is of course a context sensitive question.

You talk as if BM25 is the gold standard when it is not.

The research on this is all over the place. I just read an article that says that BM25 is way worse than alternative language models. You don't have to look far, for example this one that talks about Wand:

https://dl.acm.org/doi/10.1145/2537734.2537744

You know why I quit academia? It is useless arguments and virtual signalings like these.

I'd rather go out and build a damn thing that people like to use.

I merely pointed out that there's a problem, and I have a solution. I did not claim that my problem and my solution solve all problems. Isn't this obvious?

In my problem, my solution beats Lucene. It's as simple as that.

So if Lucene wants to be this infinitely configurable search library, it would be advisable to offer my solution as an option, or offer an better one that does something similar.

So far I have not seen any takers, only excuses.