Hacker News new | ask | show | jobs
by itronitron 2905 days ago
The Lucene API has a lot of language specific tokenizers and analyzers that will help normalize what a term is in the index regardless of language. You can then apply various statistical NLP methods which tend to be more language agnostic.