Hacker News new | ask | show | jobs
by writeslowly 3693 days ago
I've been experimenting with Stanford's CoreNLP to identify named entities for analyzing RSS feeds and I was really impressed by how well it worked, having known nothing about the state of NLP research before I started. Especially things like being able to identify coreferences.
2 comments

I was actually pretty disappointed with the NER in CoreNLP - I fed a few articles (including this one) into it, and while it's impressive that a computer can do this at all, it's pretty far away from being able to build a usable product. It seems to over-recognize Persons, for example - Parsey McParseFace was tagged as a person, as were Alice and Bob, as was Tesla (in another article), and while all of these are understandable, they weren't the intended meanings in the articles. I was also pretty disappointed with the date parser: while it gets some tricky ones like "Today" and "7 hours ago", it misses very common abbreviations like 7m or 7min or even "7min ago".
The state of NLP tools generally is much lower than most people think. People think it is much easier than it is.

For the date parser you want http://nlp.stanford.edu/software/sutime.html

The code and rules aren't fun to customize though.

Yeah, I looked at SuTime, but it fell down on many common cases (the CoreNLP online demo is actually integrating SuTime into the annotations it produces).

Another option is Natty [1], but it also seems to fail on the same examples. Natty at least has an ANTLR grammar that's reasonably easy to understand, though.

[1] http://natty.joestelmach.com/

I know of one large group that switched (from Timen[1]) to Heideltime[2] because of multi-language support.

One day someone will build a neural net model to do this rather than hand written rules.

[1] https://github.com/leondz/timen

[2] https://github.com/HeidelTime/heideltime

Thanks 'nl, 'nostrademons and 'rcpt for the links! I've been using Chronicity[0] in my project, and I hand-hacked a Polish-to-English regexp "translator" to make it work with Polish language[1]. I'll be looking at the sources of the libraries you provided as well as papers they reference; maybe I'll manage to steal some code :).

[0] - https://github.com/chaitanyagupta/chronicity

[1] - it's surprising how easy is to get 80% there with hacks like these: https://github.com/TeMPOraL/alice/blob/master/language.lisp#...

Heideltime is the best date tagger I've used. Handles multiple languages better than anything else and fits inside uima

https://github.com/HeidelTime/heideltime

Yes, I've used that before. I'm currently using Textacy for python which is also really good. However, extracting the named entities from a sentence is still a ways off from determining what's the subject of the sentence, although it gives a good indication. Using NER + quality POS tagging and tree building should do the trick for me I think.
Are you using BOW for sentiment analysis? Also, have you tried tinkering with Watson's sentiment analyzer?

I'm working on a project that analyzes sentiment from speech, and I've been meaning to start on text sentiment analysis, but I'm not sure where to start.

I'm using VADER - https://github.com/cjhutto/vaderSentiment - because it's trained on NYT data which makes it suitable for news sentiment parsing.

The code is pretty readable but relies heavily on a ruleset which might need to be tweaked for one's need.