Hacker News new | ask | show | jobs
by gilesc 5589 days ago
The best toolkits are probably in Java:

-Stanford's Tagger, Parser, and NLP Core

-Apache OpenNLP

-Lingpipe

Many smaller components are made to be compatible with IBM UIMA (of Watson fame), so they are able to be integrated into a pipeline somewhat easily. For examples of this in biomedical TM, see http://u-compare.org/ .

People will kill me for saying this, but truly: Python's performance isn't adequate for large-scale text mining, _especially_ if you want to do deep/full parsing. Shallow parsing as shown in this package's demo is more feasible.

I personally find NLTK convoluted, but in its favor, it does have readers for a TON of corpora, which is really nice.

1 comments

My friends in the natural language field tell me Python and NLTK are more common than Java. Then again, this is at a sort-of Python-centric university (Toronto).