Hacker News new | ask | show | jobs
by ekzhu 1670 days ago
I cannot continue reading after this following “declaration”… Author should take a look at the Wikipedia page for TF-IDF.

> As someone who has a Ph.D. in Human-computer Interaction ;-), I feel like I am entitled to define a condition of "good" in relevance here. I hereby declare that:

>> A good top-K algorithm should rank a document containing more user query terms higher than a document containing less number of user query terms.

> This makes perfect sense. Right?

Also, “most search engines” don’t use vector space model as the only way to rank result, for example, page rank.

Edit: in some search scenarios finding the documents with the most query terms make sense, but Lucene can also rank using this metric. Still, applaud the author's effort in digging into research literature. Search relevance is very hard and standard off the shelf metrics like TF-IDF and page rank are often not enough. Good search usually requires deep understanding of the specific subject domain and hand-tuning tons of signals, many of which aren't even strictly based on search terms (e.g., previously purchased products on a store's website, geographic location, trending results).

4 comments

As author of Relevant Search and contriburor of AI powered search I endorse this :)

Relevance is really subjective, domain specific, requires intense amount of measurement and testing and many different ranking signals. Lucene is a toolbox for crafting many of these signals.

On second read-thru, I think the author is maybe(?) describing assumptions behind WAND and relevance algos that benefit from it? Maybe not some overarching statement about relevance per-se? But it's mixed in with statements about relevance / what Lucene does that are mostly incorrect...

For example, he says

> However, as you can see, this vector space model does not explicitly require a higher ranking document to contain more query terms than a lower ranking one.

Well the way you get a higher similarity in a vector-space model is matching more terms. The caveat being that IDF and field length makes you also consider a term's specificity. So if you search for 'luke skywalker' you care more about the 'skywalker' match than the 'luke' match. But a match on BOTH 'luke skywalker' would score higher (field lengths being constant)

There's no guarantee in pure vector space model that it is the case. Your understanding is way off. This is a caveat that Lucene prominently put on their Web page.

https://lucene.apache.org/core/3_5_0/scoring.html

"Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms scores lower than a different document with only one of the query terms."

See? Lucene people know about it but they just do not think it is a problem.

But I do.

T-Wand, though also uses vector space model, makes sure it is the case.

No there’s no guarantee. A very strong bias, but no absolute guarantee. Most people use AND queries by default or set a high enough min should match.

Those settings aside, arguably there are cases where fewer term matches could be more relevant.

For the search “Luke skywalker”

A tweet for example mentioning the term “Skywalker” once has much higher “Star Wars” aboutness than a move that uses skywalker one page and Luke pages apart.

That is the information density of Star Wars type content is far higher in the tweet, whereas eventually a book ends up using most English somewhere.

You either has or has no guarantee, that's what the word "guarantee" means. Don't mince words.

You had a wrong understanding. Now you are corrected. Let's move on.

Nobody said anything about "fewer term matches could not be more relevant". You are just making up straw men here. That's not the discussion we are having.

What I said is this, from a user point of view, it's not good to have a document containing fewer query terms to rank higher. This is a fact that even Lucene acknowledge (at least when they were version 3.5.0.). You have nothing to counter this fact.

But you’re arguing that such a guarantee ALWAYS is most relevant. When that’s not always the case.

There’s been extensive research justification behind the vector-space model. BM25 is the 25th iteration of a model and well tuned BM25 holds the highest non nueral performance on many tasks including question answering[1]. Research has long found including factors other than total term matches matters. Such as IDF[2] and field length[3].

Have you benchmarked your relevance assumptions similarly? If so I’d love to see them and learn more!

1 - https://www.elastic.co/blog/improving-search-relevance-with-...

2- https://www.researchgate.net/publication/238123710_Understan...

3 - http://sifaka.cs.uiuc.edu/course/410s12/mir.pdf

I also suggest you to look at his "benchmarks" code
It's your loss then.

It is actually smug, to suggest the author, in this case, me, a computer scientist and a past professor who taught Information Retrieval class for more than 3 years, and who just came up with a new search algorithm, to "take a look at Wikipedia page for TF-IDF".

In case you have not read the article due to your smugness, my search algorithm also uses TF-IDF and vector space model.

Please do not dis-understand what is going on here: I am running a startup, and I am also old enough to not care about publications as much as people who are younger or in academia.

That's why I chose to reveal this in a blog post instead of hiding it until after my paper is published. Understood?

Haters gonna hate, dont mind them. Your work is truly amazing, I learned a lot and you opened a whole new world of datalog as an aside.
Thank you. I am glad that you learned something from the article. I wrote the article for people like you, who are seeking knowledge and self improvement.
Yeah beyond the cringe of thinking a Ph.D really means anything, its just highlights the pure lack of Lucene knowledge.
A Ph.D means a lot for many people. It's years of work to make the science progress a tiny little bit on a topic you enjoy (supposedly). I think it's a lot harder and meaningful than writing scalable and reliable code as a team in a software company, just to give one example to compare with.
But a phd in HCI has very little relevance to information retrieval or computational linguistics.
Exactly this. I have a PHD in CS and am a world class expert in multi-omic data integration and analysis. I'm happy to throw my weight around in that area, but I'd never point to my PhD to pontificate on Neural nets or systems or queuing theory or 99 percent of CS. If getting a PhD doesn't teach you how much you don't know and how hard it is to develop real expertise in any area, I think you wasted your PhD. Note, my PhD isn't on the value of PhDs so take it as you will.
I think I am pretty qualified to make my declaration, since it is about user experience.

I also do know a lot about IR, because I taught Information Retrieval class for 3 years when I was teaching in university, I read research papers, and I just come up with a new search algorithm.

It's just that some people cannot accept that there are people who can cross fields with ease, make contributions quickly, and move on to the next field that pit their interests.

Yes, I am one of those people. In addition to HCI, I also published in the following areas: VR, DB, NLP, IR and Psychology. Sorry to hurt your feelings, but it is what it is. Accept it and move on.

That's not a new search algorithm. It's about the second one people have come up with. First version: all documents containing all search terms. Second version: counting frequency in document. Since you taught IR for three years, I'm surprised there's no mention of other relevant heuristics, or measuring against benchmarks.
Of course it doesn't and the author doesn't try to make that claim. The author jokes that their HCI PhD is only good enough to give them the authority to make obvious statementd about what a "good" search experience should be for a user.

What did I miss?

It does have relevance to user experience, which is under discussion.
Well yeah, but it shouldn't make you smug. A Ph.D, while not the beginning, is definitely not the end, and the people we look up to as Really Smart never sat on their laurels and quit learning, never considered themselves above others.
It's fascinating seeing what hacker news thinks is smug. There are plenty of actually smug comments that go without being called out but for some reason this guy does.

The PhD flex in jest by the blog post's author is a bit awkward but I don't know that I would characterize it as smug.

Any insight into why the author is perceived as smug would be appreciated.

Let me offer my arguably biased insight: racism.

I am obviously a Chinese.

In the eyes of racists, Chinese Americans are supposed to be timid and does not make any noises, but I do make noises, so I am perceived as a smug today because I said I am a Ph.D, something else next time I said something else.

It's not complicated.

Wow, I was with you until this. I, for one, never even suspected your origin or nationality reading the blog post.

OK, re-reading it, there are a few hints (a Chinese proverb, your nickname, etc.). But nothing "obvious".

Probably not the explanation we are looking for here.

Agreed. The wink right after it, I would have thought, dulled any smugness.

I could see calling it audacious. But, our industry advances on audacity.

Well it’s certainly more meaningful to the person who has one.