Hacker News new | ask | show | jobs
by bryanrasmussen 1040 days ago
I guess if you wanted to do decompounding and stemming you should make the fields with the stemmed values and the decompounded values yourself and ... then implement it for the queries as well? Or is there a way to do that kind of thing somewhere in there?
2 comments

I found that stemming the text before generating vectors helps increase recall and the vectors still capture context, etc. However it does hurt precision because some information is lost by stemming. The more recent vector training algorithms are better able to capture semantic, syntactic, and contextual similarity without a lot of preprocessing. So I have found that vectors can replace all the nonsense that used to be needed to increase recall: stemming, manual synonym lists, etc.

However vector similarity search only helps with the literal text search not ranking. Tf/idf, bm25, page rank, learn to rank ML, etc are still needed to rank documents. Whenever I find a new vector search engine, I always look to see what ranking features it has beyond vector similarity.

I would want to do sort of similar to Lucene's support for both stemmed and non-stemmed fields together - so that you could rank the hit in the non-stemmed field higher than the hit in the stemmed field - so helping the precision.

In my experience this is more useful in complicated document searches.

At the moment you would need to do this yourself. It would be possible to have additional preprocessing to accommodate this though. Feel free to add a feature request here https://github.com/marqo-ai/marqo/issues. The other consideration is that you would want the distribution of the content and queries to match what the selected model was trained on.