Hacker News new | ask | show | jobs
by gibrown 3921 days ago
Lucene is moving away from TF-IDF to BM25 as the default. Pretty similar idea, but tends to performs a better with short content.

https://issues.apache.org/jira/browse/LUCENE-6789

https://en.wikipedia.org/wiki/Okapi_BM25

In the very limited test cases where I've compared them it hasn't mattered much, but other's results are pretty compelling.

https://www.elastic.co/blog/found-bm-vs-lucene-default-simil...

1 comments

Vector Space replacing or being combined with TF-IDF approaches is new way of summarizing and searching for meaning in documents...

http://52.11.1.7/TuataraSum/example_context_control-ml2.html

Interesting. This basically uses the background word2vec data for the entire Web to provide more information and help with things like disambiguation, synonyms, etc? Am I understanding that correctly?

Maybe nit-picky thought, but its not clear to me that the TF-IDF part is what's doing a lot of extra lifting there.

Do you know of any good evaluations between using vector space data and other methods for summarization?

Word2Vec was a fork or based on a more exhuastive vector space approach here https://www.kaggle.com/c/word2vec-nlp-tutorial/forums/t/1234...

I've compared the summarization to others like OTS http://libots.sourceforge.net/ which I believe strictly relies on TF-IDF and it seems better and allows for context to control the summarization.

Other similar approaches might be based on Latent Semantic Analysis, Latent Semantic Indexing or LDA.

Thanks for the links!