| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by anilshanbhag 4864 days ago
	This is just like stanford parser. http://nlp.stanford.edu:8080/parser/ Why use TextRazor and pay for it ?

3 comments

mark_l_watson 4864 days ago

The Stanford NLP tools are very good, and also GPLed, which works for a lot of projects. If the GPL doesn't work for you, the Apache OpenNLP project is also good.

link

mark_l_watson 4864 days ago

BTW, it is not just having software packages to use: it is a ton of work obtaining and preparing training data. That said, Stanford NLP and OpenNLP tools come out of the box with trained models for tagging, entity name recognition, etc. For lots of uses, these pre-trained models will work well for you.

link

tcwc 4864 days ago

The Stanford parser is great, but isn't really the same. The Stanford entity recogniser is limited to the standard types of people, places, companies, but we identify and disambiguate into a far richer ontology from wikipedia, and can recognize topic abstractions that aren't explicitly mentioned.

Also we found the Stanford tools (and the other open source NLP tools) were difficult to integrate into "production" apps for various reasons. One big one was performance - we aim to run the full parsing and extraction pipeline on an average news story in a few hundred milliseconds, which can be an order of magnitude faster than the others.

link

JPKab 4864 days ago

How does your offering compare to Calais from Thomson-Reuters?

Edit: To be specific, it looks very similar. What do you have that Calais doesn't?

link

mark_l_watson 4864 days ago

I have been using the free tier (50K API calls per day) of Open Calais for years and have also used it in code examples in three books I have written.

One thing that Open Calais does that I really like is that they attempt to have a single URI uniquely identifying recognized named entities. This is useful because, for example, when it recognizes President Bill Clinton, you get a reference to a unique URI, even if his name, title is different in different processed texts.

Thomson-Reuters bought ClearForest several years ago, thus acquiring Calais. If you are interested in text mining, and if you haven't experimented with Open Calais, then please put that on your TODO list.

link

gsharma 4864 days ago

Just tried http://nlp.stanford.edu:8080/parser/ it only allows up to 70 characters for parsing.

link

mark_l_watson 4864 days ago

Try downloading the distribution of code and data and run it locally. Java stuff, maven based, easy to run. Use the example code listed on the installation web page to see how to set it up.

link

gsharma 4864 days ago

Ahh, thanks. I didn't realize it was just a demo. Sounds like a fun weekend project.

link