Hacker News new | ask | show | jobs
by sixtosugarman 4466 days ago
You may use Levenshtein Distance to get better results by taking word variations into account.

And also you can enhance it by using semantic similarity scores for strings.

1 comments

If you're willing to get into actual NLP, then semantic similarity would certainly be one way to go. Is there any equivalent to Stanford (Java) or NLTK (Python) in Ruby land? But I'm not sure that Levenshtein will necessarily get you better results than the bag-of-words approach the author is taking with Jaccard distance, if all you're doing is document classification.
As far as NLP libraries in Ruby land, there is both [treat](https://github.com/louismullie/treat) and [ruby bindings to the Stanford Core NLP](https://github.com/louismullie/stanford-core-nlp).
I've used OpenNLP with jRuby for my NLP experiment. Check it out https://github.com/otobrglez/politiki-ner to get an idea how to mix it.