|
|
|
|
|
by kirillkh
3160 days ago
|
|
Some clarifying questions: 1) Do words in a generic corpus (such as Wikipedia) actually form well-separated clusters? 2) Is it correct that you find word clusters in the corpus as a preprocessing step (as opposed to at indexing or query time)? 3) Do I understand correctly that you use all words in clusters as synonyms and pass them to Solr at query/indexing time? Is it query time, index time or both? 4) Given a language where words have many syntactic forms (e.g. buy-bought-buying), how does it work with clusters? Do both syntactic forms and synonyms end up in the same cluster? Wouldn't it be beneficial to treat many of these different forms as the same word (i.e. perform stemming) and only list truly different, but closely related concepts as synonyms? |
|
2. Correct. The goal of the pre-processing step is to generate a Solr synonyms file which can be added to your index mapping.
3a. You could use all the words, but in general I would advise against it. Using all the words from Wikipedia or Google News would be similar to using a thesaurus which can add a lot of noise. For example, the word "cocoa" could mean chocolate, a city in Florida, or programming language. It is better to use a list of domain-specific keywords and phrases as a filter for which words are added to the Solr synonyms file. However if your corpus is Wikipedia, Google News, or something equally generic, then using all the words makes sense.
3b. It must be both query and index time. For example, the phrase "java developer" would have the mapping "java developer => cluster_15" in the synonyms file. In order for the search terms "java developer" to match cluster_15, "cluster_15" must be indexed in place of "java developer".
4. The different forms will most likely end up in the same cluster, but stemming would guarantee it.