Hacker News new | ask | show | jobs
by tcwc 4865 days ago
The Stanford parser is great, but isn't really the same. The Stanford entity recogniser is limited to the standard types of people, places, companies, but we identify and disambiguate into a far richer ontology from wikipedia, and can recognize topic abstractions that aren't explicitly mentioned.

Also we found the Stanford tools (and the other open source NLP tools) were difficult to integrate into "production" apps for various reasons. One big one was performance - we aim to run the full parsing and extraction pipeline on an average news story in a few hundred milliseconds, which can be an order of magnitude faster than the others.

1 comments

How does your offering compare to Calais from Thomson-Reuters?

Edit: To be specific, it looks very similar. What do you have that Calais doesn't?

I have been using the free tier (50K API calls per day) of Open Calais for years and have also used it in code examples in three books I have written.

One thing that Open Calais does that I really like is that they attempt to have a single URI uniquely identifying recognized named entities. This is useful because, for example, when it recognizes President Bill Clinton, you get a reference to a unique URI, even if his name, title is different in different processed texts.

Thomson-Reuters bought ClearForest several years ago, thus acquiring Calais. If you are interested in text mining, and if you haven't experimented with Open Calais, then please put that on your TODO list.