|
|
|
|
|
by rpedela
3158 days ago
|
|
1. It should. The talk recommends using multiple cluster sizes (e.g. 50,500,5000) and give more weight in the query to smaller clusters. Ideally you would run word2vec on your own domain-specific corpus and then cluster, but that only works if your corpus is of sufficient size. 2. Correct. The goal of the pre-processing step is to generate a Solr synonyms file which can be added to your index mapping. 3a. You could use all the words, but in general I would advise against it. Using all the words from Wikipedia or Google News would be similar to using a thesaurus which can add a lot of noise. For example, the word "cocoa" could mean chocolate, a city in Florida, or programming language. It is better to use a list of domain-specific keywords and phrases as a filter for which words are added to the Solr synonyms file. However if your corpus is Wikipedia, Google News, or something equally generic, then using all the words makes sense. 3b. It must be both query and index time. For example, the phrase "java developer" would have the mapping "java developer => cluster_15" in the synonyms file. In order for the search terms "java developer" to match cluster_15, "cluster_15" must be indexed in place of "java developer". 4. The different forms will most likely end up in the same cluster, but stemming would guarantee it. |
|
For example, in Hebrew the word BRHA can mean several things: "pool", "blessing", "in soft" and "her knee" (no kidding).