Hacker News new | ask | show | jobs
by kirillkh 3168 days ago
Asking as someone who barely has any clue in this field: is there a way to use this for full-text search, e.g. Lucene? I know from experience that for some languages (e.g. Herew) there are no good stemmers available out of the box, so can you easily build a stemmer/lemmatizer (or even something more powerful? [1]) on top of word2vec or fastText?

[1] E.g., for each word in a document or a search string, it would generate not just its base form, but also a list of top 3 base forms that are different, but similar in meaning to this word's base form (where the meaning is inferred based on context).

1 comments

You can do all that and more: for example, to find lexical variations of a word, just compute word vectors for the corpus and then search the most similar vectors to a root word, that also contain the first letters (first 3 or 4 letters) of the root. It's almost perfect at finding not only legal variations, but also misspellings.

In general, if you want to search over millions of documents, use Annoy from Spotify. It can index millions of vectors (document vectors for this application) and find similar documents in logarithmic time, so you can search in large tables by fuzzy meaning.

https://github.com/spotify/annoy