Hacker News new | ask | show | jobs
by PaulHoule 990 days ago
I've used

https://sbert.net/

My take is you might like keyword searches better for some queries and you might like embedding search for others.

The problems of: (1) How to combine keyword search and embedding search (you'd imagine you'd want a ranking function that handles both) and (2) How to handle chunks are both hard.

As for (2) you probably want to make the chunks as big as you practically can, you should be chunking on tokens instead of characters if you at all can.

With the chunks of course you don't get a score for the query-document relationship you get the query-chunk score instead which isn't quite the score you really want, aggregating all the chunk hits and properly chunking the data is an open problem to say the least.

1 comments

Thank you! I've used the CrossEncoder from the sbert library you mentioned to rank the vector search and keyword search results, but it still doesn't help that the vector search results are odd.

This video suggests Elasticsearch is (or will be) able to blend the two: https://www.youtube.com/watch?v=5Qaxz2e2dVg

I'm hoping meilisearch adds that because ES is a beast of a software package.

There's combining the two and then there is combining the two and really getting better results.

Almost nobody in the commercial search field is doing quantitative evaluation despite

https://trec.nist.gov/

About a decade ago I worked on a neural-net powered search engine for patents that used a neural network to compress bag of words features into a 50-dimensional vector (think LDA on steroids) and used a patented algorithm to

https://patents.google.com/patent/CA2829569C

combine that result with the keyword vector. It kicked ass. When we put up a demo we got a call the first day from the USPTO wanting to buy it.

The thing is that algorithm searches over the feature vector and the residual of that feature vector in the bag-of-words space. There's no danger of infringing that with a BERT-like model because there isn't any such residual.

We tuned up w/ Gov2 data from TREC and that was essential for getting our parameters right.

If I were trying that now I would use logistic regression to make a probability estimator that "this document is relevant" that uses multiple scores as input; I'm not so sure how great results you will get but it is a rational basis to get started. TREC specifically does not reward training a probability estimator because they historically have been interested in "long tail" results and they don't think a kid sticking up his hand really high because he is really certain it adds value, but there are a lot of things with IR that are really hard (alerting) because we're mostly not using probability estimators and the ones we do make aren't that great (e.g. I never see p>0.7 for a conventional search engine tuned up that way.)