Hacker News new | ask | show | jobs
NLUlite: Natural language parser and database (nlulite.com)
52 points by NLUlite 4303 days ago
11 comments

Let me know if you're interested in licensing better syntactic/semantic parsing technologies.

I'm the developer of the Redshift NLP library ( http://github.com/syllog1sm/redshift ). Currently the software lacks documentation, but it offers a good speed/accuracy trade-off. Documentation and a good tokenizer are coming. You can read a tutorial for a simplified version of the algorithm on my blog: http://honnibal.wordpress.com

Installer crapped out for me, giving a gzip error. Also, I'm not wild about self-extracting and executing archives. Would you mind perhaps just posting a .tgz that can be examined before installing?
Hi, Can you post the error?
This appearing at the same time as a post about designing a personal knowledgebase ( https://news.ycombinator.com/item?id=8270759 ) makes me wonder how reasonable it is to link this sort of natural language fact extraction and querying system to such a knowledgebase.

I suspect it might be expecting too much, but I'd love to integrate my browser history, people I've contacted, where I've been etc. in order to produce an easier way to search and find some webpage (e.g. when I remember some contextual information about the day or the place when I visited a page but am unable to find it by re-googling)

Thank you all for your comments :-)

@unsane: This is of course not supposed to happen (we tested it on many different machines). Apologies. Please write at contact@nlulite.com and we'll look into the problem.

@garblegarble: The system is still under development. Please write at contact@nlulite.com to suggest features you would like to be present.

@Syllogism: 93.6% of accuracy is impressive. At this stage, however, we prefer to use proprietary algorithms. We feel we can reach similar accuracy for version 0.2.0 (out in January)

@CGamesPlay: The server is supposed to be installed in the $HOME directory. If you wish to use a different path, you can use the option -d <YOUR_NEW_PATH> when starting the server.

@Rhapso: You are right, the non-commercial download is somewhat byzantine. The problem with wget is that you don't get to sign a non-commercial agreement. Let us think about it for a few nights.

@toblender: We are working on that ;-)

You can refer to the statement in comments at the top (hell, put the whole thing in the script, print it out when run, and make the user type yes)
Thank you for your suggestion. We are going to implement the wget option the next week.
Whenever I see "Natural Language Parser" mentioned anywhere I get excited then a little disappointed because it implies something much more profound.

Not to belittle the tremendous effort, but most projects I have seen are "English Language Parser"s.

Are there any actual generic language parsing projects out there?

That don't try to overfit to English but actually attempt to do a job of whatever quality in whatever language?

Like I'm a native English speaker, I can understand English say 100%, Japanese 80-90%, I can understand a bit of a few European languages and I can identify a bunch of other languages.

It would be wonderful if there were software with this design in mind.

> Are there any actual generic language parsing projects out there?

Chalmers University has impressive results on this - http://www.grammaticalframework.org/

That's not what's meant by Natural Language Parsers. Most are trained on datasets in English, but given the right training they can work in other languages. NLP means a bunch of different things from POS tagging to dependency parsing. What this project is doing is semantic parsing.
for some URLs the data throws an exception, for example:

http://en.wikipedia.org/wiki/Horse (I don't like snakes)

  File "/home/drace/dev/NLUlite/client_python/NLUlite.py", line 375, in add_url
    parser.feed(page)
  File "/usr/lib/python2.7/HTMLParser.py", line 114, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 158, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 305, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8: ordinal not in range(128)

It's also really slow at learning. I have a ton of everything, cores, memory etc and it takes minutes to process web pages. I guess you do say that on the website that the free version is slow.

By the way, the commercial version's parser scales almost linearly with the number of (independent) threads. The Wisdom.ask() method is also faster with the multithreaded version.
Thanks for the feedback, we are looking into it.
Please setup a better way to distribute the non-commercial version. If you insist on using the self-extracting archive, please make it accessible via wget. If this works half as well as it claims, I am willing to pay you for a commercial single-license for personal use.
When I start the server and attempt to instantiate a ServerProxy, I get "Connection refused". The server produces no output. Running ubuntu 14.4.

[append] Turns out the server will silently do nothing if you do not extract the archive to $HOME.

you just have to use -d to specify where the data files are, the server can be anywhere. I moved it out of home to my dev environment without problems.
Unfortunately it's only for Linux at this time.
it's really simple to get an x64 linux VM up and running, virtual box takes just a few minutes to spin up. I think Linux makes the most sense, it doesn't really make sense for the developer waste cycles on porting when people can just use a VM to run it very easily.
How does this compare to nltk, opnnlp or mahout?
You need to try all four :-) More seriously: NLUlite is supposed to work "out of the box", without any need of additional training datasets for the parser (often these datasets can be quite expensive). Mahout is a different type of machine learning, as it does not look into the grammar of sentences.
can you dump out the index and examine if the most important noun phrases and named entities are being extracted, a la lucene?
Exactly as hnriot said: the saved Wisdoms are in plain xml, where the sentences are represented in DRT (http://bit.ly/1AhuVas)
when you 'teach' it, the wisdom files that are saved are in plain old xml