Hacker News new | ask | show | jobs
by kirillkh 3153 days ago
How can this be used for full-text search, e.g. with Lucene? The first step in indexing a document for full-text search is reducing each word to its base form, and similarly for a search string. While it's not a difficult problem in English, in some languages (e.g. Herew) it's notoriously hard to figure out the base form of a word and further disambiguate its meaning, as the only way to do so is based on context. So how can you easily build a stemmer/lemmatizer on top of these instruments to perform such task?
3 comments

This presentation from 2015 [1] answers your question. The basic idea is to create a list of keywords and/or phrases for your corpus. It can either be done manually or automatically using gensim or a similar tool. Then use word2vec to create vectors for the keywords and phrases. Cluster the vectors and use the clusters as "synonyms" at both index and query time using a Solr synonyms file.

You can also use Brown clustering [3] to create the clusters. It does a good job and is faster to compute than clustered word vectors. However clustered word vectors typically have better semantic performance.

1. https://www.slideshare.net/mobile/lucidworks/implementing-co...

2. Demo source: https://github.com/DiceTechJobs/ConceptualSearch

3. https://github.com/percyliang/brown-cluster

Some clarifying questions:

1) Do words in a generic corpus (such as Wikipedia) actually form well-separated clusters?

2) Is it correct that you find word clusters in the corpus as a preprocessing step (as opposed to at indexing or query time)?

3) Do I understand correctly that you use all words in clusters as synonyms and pass them to Solr at query/indexing time? Is it query time, index time or both?

4) Given a language where words have many syntactic forms (e.g. buy-bought-buying), how does it work with clusters? Do both syntactic forms and synonyms end up in the same cluster? Wouldn't it be beneficial to treat many of these different forms as the same word (i.e. perform stemming) and only list truly different, but closely related concepts as synonyms?

1. It should. The talk recommends using multiple cluster sizes (e.g. 50,500,5000) and give more weight in the query to smaller clusters. Ideally you would run word2vec on your own domain-specific corpus and then cluster, but that only works if your corpus is of sufficient size.

2. Correct. The goal of the pre-processing step is to generate a Solr synonyms file which can be added to your index mapping.

3a. You could use all the words, but in general I would advise against it. Using all the words from Wikipedia or Google News would be similar to using a thesaurus which can add a lot of noise. For example, the word "cocoa" could mean chocolate, a city in Florida, or programming language. It is better to use a list of domain-specific keywords and phrases as a filter for which words are added to the Solr synonyms file. However if your corpus is Wikipedia, Google News, or something equally generic, then using all the words makes sense.

3b. It must be both query and index time. For example, the phrase "java developer" would have the mapping "java developer => cluster_15" in the synonyms file. In order for the search terms "java developer" to match cluster_15, "cluster_15" must be indexed in place of "java developer".

4. The different forms will most likely end up in the same cluster, but stemming would guarantee it.

4) But recall that the language in question is such that stemming is hard. If you expand every form of every word in Hebrew, you obtain something like 600,000 words. And many of them have completely different meanings due to syntactic coincidences and short roots. So, ideally, the first step would a) determine what exactly each word in this document means in the given context, b) replace it with an unambiguous identifier.

For example, in Hebrew the word BRHA can mean several things: "pool", "blessing", "in soft" and "her knee" (no kidding).

I don't know anything about Hebrew. But maybe lemmatization would be better than stemming since it takes meaning and context into account. It is also possible that it is an unnecessary step for clustered word vectors for your use case. If it was me, I would try without stemming/lemmatization first.

EDIT

I found this Hebrew analyzer for Lucene/Solr/Elasticsearch [1] which appears to do stemming or lemmatization. Potentially you could use the output of the analyzer as the input to word2vec.

1. https://github.com/synhershko/HebMorph

Please forgive me attempting to milk as much as possible from this discussion - I just don't have many opportunities to get useful advice on this subject, and I've been mulling over it for a long time.

> But maybe lemmatization would be better than stemming

You're right, I'm using "stemming" and "lemmatization" interchangeably where I shouldn't. What I mean is lemmatization.

> It is also possible that it is an unnecessary step for clustered word vectors for your use case

I don't focus on a specific use case, I'm just trying to find a way to enable full-text search for Hebrew. Searching based on concept similarity is a very cool addition, though, and I do have some use cases in mind for it specifically. But I'm just thinking what a typical cluster would look like, and I imagine 99.9% of it will be different forms of the same handful of base forms. Furthermore, telling Lucene to match based on all these forms will inevitably create a large number of false positives due to the aforementioned abundance of homonyms. So I can see a clear problem here even now. That's why I keep reiterating my original question of whether this system can first be used for lemmatizing and then everything else.

Thank you! This sounds exactly what I was imagining. Very exciting!
For integrating into Solr, I've used Word2vec to improve rankings the synonyms it finds (boosting synonyms by how similar they are to the query). In English Word2vec tends to think plurals are synonyms.

I did a talk on this with more details: https://www.slideshare.net/GarySieling/ai-with-the-best-buil...

Sense2Vec might also help: https://github.com/explosion/sense2vec

There is also a project called Vespa, which looks potentially interesting as a replacement for Lucene - https://github.com/vespa-engine/vespa

The dice talk mentions that also (weighting by word2vec similarity). It's important to note that word2vec, LSA, sense2vec, etc, all find words that are RELATED but not necessarily SYNONYMOUS. For instance, antonyms like black and white, rich and poor, often appear in the same contexts, have the same word type but are opposite in meaning. Similarly, politicians on the opposite ends of the political spectrum will usually get assigned similar vectors and the same cluster as they tend to appear in similar contexts. It uses the context (word window in word2vec and GloVe, the document in LSA) to determine a measure of similarity. But the context for antonyms is typically very similar. Attaching the part-of-speech tag to each word before pushing it through these models can help as it enforces that grammatical relation, but this won't address all of these issues (e.g. black and white are both adjectives). In my experience, if people mostly search for nouns in your search engine (e.g. job search) this issue is also less of a concern, but can still cause cause problems. Finally, conceptual search can also help with precision - by matching across all concepts within a document, you can help disambiguate its meaning, when you have words that have multiple meanings.
> Finally, conceptual search can also help with precision - by matching across all concepts within a document, you can help disambiguate its meaning, when you have words that have multiple meanings

Thanks! I've written up something along these lines here: https://news.ycombinator.com/item?id=15592196

I'd love to hear your opinion if this is going to work.

Run Doc2Vec (or Word2Vec) on a large corpus of text or download pretrained vectors. To compute a document vector, take a linear combination of the word vectors in the document according to TFIDF. Now that you have vectors for each document, you need to create a fast index with a library called "Annoy". It can do very fast similarity search in vector space for millions of documents. I think this approach works faster than grep and doesn't need to bother with stemming. It will automatically know that "machine learning" and "neural nets" are related, so it does a kind of fuzzy search.
If you wanted it to know that "machine learning" and "neural networks" were related, wouldn't you need to do some type of entity extraction first, since Word2vec is run on tokens?
You can use Gensim:

    from gensim.models.phrases import Phrases
    bigrams = Phrases(corpus)
or you could rank bigrams by count(w1+w2)^2/(count(w1)*count(w2))

many variations on this formula work, but the idea is to compare the count of the bigram to the counts of the unigrams.

By the way, you do bigram identification before Word2Vec to have specialized vectors for bigrams as well.

Besides this method, there is one great way to identify ngrams: use Wikipedia titles. It's quite an extended list that covers most of the important named entities, locations and multi-word topic names, or go directly to http://wiki.dbpedia.org/ for a huge list with millions of ngrams. Cross reference it with your text corpus and you get a nice clean list.

The original word2vec source code comes with a probabilistic phrase detection tool. Keyword: word2phrase.
Good to know, thanks!
Alternatively to tf-idf, there's an interesting property in word embeddings generated by word2vec : they're sorted by rarity (the most common words being on top of the list).

So if you insert them in the same order in a database, you can just use their primary key as weight for a word. This also has the advantage of filtering out stop words without any additional processing.

If I understand correctly, this forgoes Lucene entirely. I would really like something that can be integrated into Lucene/Solr due to the availability of all the infrastructure build around it.

> works faster than grep

I didn't quite get the connection to grep.

Suppose you have gigabytes of text, Annoy will find matching articles faster and more precise than grepping with keywords.
Lucene is faster and better than grep too. Annoy may be better than Lucene's "more like this" query which is for finding similar documents in an index to a given set of documents. But how would it be helpful for keyword search which is what is being asked about?
I know, inverted index search is fast, it is the basic search engine algorithm, but there is a difference in quality of top ranked results. With word vectors you can ensure the topic of the whole document is what you want. Many documents mix topics and some keywords appear by mistake in the wrong place, for example, because scraping web text is imperfect and might capture extra text.