|
|
|
|
|
by kirillkh
3168 days ago
|
|
Asking as someone who barely has any clue in this field: is there a way to use this for full-text search, e.g. Lucene? I know from experience that for some languages (e.g. Herew) there are no good stemmers available out of the box, so can you easily build a stemmer/lemmatizer (or even something more powerful? [1]) on top of word2vec or fastText? [1] E.g., for each word in a document or a search string, it would generate not just its base form, but also a list of top 3 base forms that are different, but similar in meaning to this word's base form (where the meaning is inferred based on context). |
|
In general, if you want to search over millions of documents, use Annoy from Spotify. It can index millions of vectors (document vectors for this application) and find similar documents in logarithmic time, so you can search in large tables by fuzzy meaning.
https://github.com/spotify/annoy