Hacker News new | ask | show | jobs
by nostrademons 3695 days ago
I was actually pretty disappointed with the NER in CoreNLP - I fed a few articles (including this one) into it, and while it's impressive that a computer can do this at all, it's pretty far away from being able to build a usable product. It seems to over-recognize Persons, for example - Parsey McParseFace was tagged as a person, as were Alice and Bob, as was Tesla (in another article), and while all of these are understandable, they weren't the intended meanings in the articles. I was also pretty disappointed with the date parser: while it gets some tricky ones like "Today" and "7 hours ago", it misses very common abbreviations like 7m or 7min or even "7min ago".
2 comments

The state of NLP tools generally is much lower than most people think. People think it is much easier than it is.

For the date parser you want http://nlp.stanford.edu/software/sutime.html

The code and rules aren't fun to customize though.

Yeah, I looked at SuTime, but it fell down on many common cases (the CoreNLP online demo is actually integrating SuTime into the annotations it produces).

Another option is Natty [1], but it also seems to fail on the same examples. Natty at least has an ANTLR grammar that's reasonably easy to understand, though.

[1] http://natty.joestelmach.com/

I know of one large group that switched (from Timen[1]) to Heideltime[2] because of multi-language support.

One day someone will build a neural net model to do this rather than hand written rules.

[1] https://github.com/leondz/timen

[2] https://github.com/HeidelTime/heideltime

Thanks 'nl, 'nostrademons and 'rcpt for the links! I've been using Chronicity[0] in my project, and I hand-hacked a Polish-to-English regexp "translator" to make it work with Polish language[1]. I'll be looking at the sources of the libraries you provided as well as papers they reference; maybe I'll manage to steal some code :).

[0] - https://github.com/chaitanyagupta/chronicity

[1] - it's surprising how easy is to get 80% there with hacks like these: https://github.com/TeMPOraL/alice/blob/master/language.lisp#...

Heideltime is the best date tagger I've used. Handles multiple languages better than anything else and fits inside uima

https://github.com/HeidelTime/heideltime