Let me know if you're interested in licensing better syntactic/semantic parsing technologies.
I'm the developer of the Redshift NLP library ( http://github.com/syllog1sm/redshift ). Currently the software lacks documentation, but it offers a good speed/accuracy trade-off. Documentation and a good tokenizer are coming. You can read a tutorial for a simplified version of the algorithm on my blog: http://honnibal.wordpress.com
Installer crapped out for me, giving a gzip error. Also, I'm not wild about self-extracting and executing archives. Would you mind perhaps just posting a .tgz that can be examined before installing?
This appearing at the same time as a post about designing a personal knowledgebase ( https://news.ycombinator.com/item?id=8270759 ) makes me wonder how reasonable it is to link this sort of natural language fact extraction and querying system to such a knowledgebase.
I suspect it might be expecting too much, but I'd love to integrate my browser history, people I've contacted, where I've been etc. in order to produce an easier way to search and find some webpage (e.g. when I remember some contextual information about the day or the place when I visited a page but am unable to find it by re-googling)
@unsane: This is of course not supposed to happen (we tested it on many different machines). Apologies. Please write at contact@nlulite.com and we'll look into the problem.
@garblegarble: The system is still under development. Please write at contact@nlulite.com to suggest features you would like to be present.
@Syllogism: 93.6% of accuracy is impressive. At this stage, however, we prefer to use proprietary algorithms. We feel we can reach similar accuracy for version 0.2.0 (out in January)
@CGamesPlay: The server is supposed to be installed in the $HOME directory. If you wish to use a different path, you can use the option -d <YOUR_NEW_PATH> when starting the server.
@Rhapso: You are right, the non-commercial download is somewhat byzantine. The problem with wget is that you don't get to sign a non-commercial agreement. Let us think about it for a few nights.
Whenever I see "Natural Language Parser" mentioned anywhere I get excited then a little disappointed because it implies something much more profound.
Not to belittle the tremendous effort, but most projects I have seen are "English Language Parser"s.
Are there any actual generic language parsing projects out there?
That don't try to overfit to English but actually attempt to do a job of whatever quality in whatever language?
Like I'm a native English speaker, I can understand English say 100%, Japanese 80-90%, I can understand a bit of a few European languages and I can identify a bunch of other languages.
It would be wonderful if there were software with this design in mind.
That's not what's meant by Natural Language Parsers. Most are trained on datasets in English, but given the right training they can work in other languages. NLP means a bunch of different things from POS tagging to dependency parsing. What this project is doing is semantic parsing.
File "/home/drace/dev/NLUlite/client_python/NLUlite.py", line 375, in add_url
parser.feed(page)
File "/usr/lib/python2.7/HTMLParser.py", line 114, in feed
self.goahead(0)
File "/usr/lib/python2.7/HTMLParser.py", line 158, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.7/HTMLParser.py", line 305, in parse_starttag
attrvalue = self.unescape(attrvalue)
File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8: ordinal not in range(128)
It's also really slow at learning. I have a ton of everything, cores, memory etc and it takes minutes to process web pages. I guess you do say that on the website that the free version is slow.
By the way, the commercial version's parser scales almost linearly with the number of (independent) threads. The Wisdom.ask() method is also faster with the multithreaded version.
Please setup a better way to distribute the non-commercial version. If you insist on using the self-extracting archive, please make it accessible via wget. If this works half as well as it claims, I am willing to pay you for a commercial single-license for personal use.
you just have to use -d to specify where the data files are, the server can be anywhere. I moved it out of home to my dev environment without problems.
it's really simple to get an x64 linux VM up and running, virtual box takes just a few minutes to spin up. I think Linux makes the most sense, it doesn't really make sense for the developer waste cycles on porting when people can just use a VM to run it very easily.
You need to try all four :-) More seriously: NLUlite is supposed to work "out of the box", without any need of additional training datasets for the parser (often these datasets can be quite expensive).
Mahout is a different type of machine learning, as it does not look into the grammar of sentences.
I'm the developer of the Redshift NLP library ( http://github.com/syllog1sm/redshift ). Currently the software lacks documentation, but it offers a good speed/accuracy trade-off. Documentation and a good tokenizer are coming. You can read a tutorial for a simplified version of the algorithm on my blog: http://honnibal.wordpress.com