RL3: a rule-based information-extraction and entity-recognition engine

Y	Hacker News new \| ask \| show \| jobs

	RL3: a rule-based information-extraction and entity-recognition engine (rl3.zorallabs.com)
	58 points by jo_kruger 2818 days ago

4 comments

wener 2815 days ago

Not free https://rl3.zorallabs.com/wiki/RL3_License

link

philprx 2815 days ago

Yes it's a bit sad to see a project that otherwise would be interesting.

link

rl3 2815 days ago

I was looking forward to chiming in as this project's unofficial mascot, but it seems they'll have to pay me now. :)

link

jo_kruger 2815 days ago

It is free for non-commercial (personal, research and educational) use

link

infocollector 2815 days ago

Seems to be as complicated as regex to me. Perhaps I am missing something? Any other libraries like this that are decent at this and simplify the problem?

link

tyingq 2815 days ago

I don't know if any are simpler, but here's a stack overflow question that points to other similar tools: https://stackoverflow.com/questions/17891932/open-source-rul...

link

jo_kruger 2815 days ago

We started RL3 (more than 10 years ago) because we had several projects with a huge number of patterns. We found other projects (present at that moment) were too heavy on a syntax, which make it complicated to support / manage large library of patterns. So, we tried to keep the power of regex, add new features (like named patterns, modules, templates and lookup dictionaries) but minimize additional syntax... as result we were able to enable team of computational linguists (i.e. not programmers) quite easily develop and support huge libraries of NER patterns and document classification rules.

link

kvakernaak 2814 days ago

Are there any public projects based on this engine?

link

jo_kruger 2814 days ago

Yes. The most notable are https://www.aihitdata.com and https://www.happygrumpy.com First crawls corporate websites (~25 millions) and extracts key information such as people, contacts, etc. Second is a sentiment analysis tool.

link

kvakernaak 2813 days ago

Thanks. Consider expanding the docs with more examples on how to extract different types of structured data.

link

jo_kruger 2815 days ago

It is actually based on regex (on a low level). But the key difference -- it allows named patterns, modules, templates, and kind of classes. So, you may have a separate module with a definitions of related patterns - for instance, module for date and time patterns, module for location patterns, etc... then, you can include required module to your project and refer to these patterns (by name) from your own patterns.. like "{PERSON_NAME}[,:\s]?{JOB_TITLE}" may be a pattern which matches person name followed by optional punctuation and the by job title... where PERSON_NAME and JOB_TITLE are quite complicated patterns defined somewhere else.

Another key difference - it supports huge lookup dictionaries -- i.e. part of your pattern may be reference to dictionary which may contain thousands or even millions of entries... so, PERSON_NAME can be a dictionary.

...

link

jo_kruger 2815 days ago

Refer to email example (it demonstrate some features on a simple task) https://rl3.zorallabs.com/wiki/Extract_Email_Addresses_From_...

link

tensor 2815 days ago

https://gate.ac.uk is one of the most extensive and widely used. Note that rules based techniques are generally not as effective as machine learning based techniques though. GATE actually supports both ML and rules.

link

PaulHoule 2815 days ago

Funny.

I went to a talk by Edward Snowden's boss the week before Edward went rouge and got an introduction to the kind of text analysis system which is used by people who work for three letter agencies in places like Maryland and Virginia.

It was a product from BBN that looked a lot like RL3 but much more complex and able to handle a set of 50,000 or more rules in the assumption that 50 people are going to write 1,000 rules in a week as opposed to 1 expert is going to write 50 a day and then take 3 years to write that many rules.

This is an "expert system" which has to perform well in some particular domain so it might be analyzing police reports which have a certain structure, setting and cast of characters. (or medical reports or corporate filings or what did airline customers say about their flight?)

It is manageable to make a rule-based system that performs very well on specific extraction tasks and even prove that it works up to a certain point with statistical bounds. The metrics are different but I believe those systems would be laughed ought of court if they couldn't tune it up for 0.95 or better precision in some cases.

It's not cheap to write 50,000 rules but making training sets for machine learning is a lot of work too. I get good classifiers (better than mech turks) sometimes with 2000-5000 samples and some friends of mine in South America would expect you to provide about 20,000 samples for an simple data extraction pipeline.

Unfortunately I can't point you to references in the academic literature because the data sets are proprietary. NLP engineers look at metrics all day that they can't show you and it is too bad because you get a distorted picture of machine learning if you are doing Yann LeCunn's digits because that is a carefully built 'toy' problem which has had obstacles cleared away for you that is just big enough that the result looks miraculous.

link

jo_kruger 2815 days ago

This was actually one of the reasons we started RL3 -- make rules development not so expensive. Note, we started it more than 10 years ago as purely internal product. And just several months ago we decided to make it public (and currently it is free for personal / research and educational use). At the end, we were able to enable team of linguists (i.e. we actually searched for linguists - not even computational linguists) to write and support large library of NER patterns. Partially, it was possible due to named patterns -- i.e. we were able to develop sort of core library of patterns by small team of expensive resources (programmers and pro comp. linguists), and larger library of patterns based on this core...

link

yazaddaruvala 2814 days ago

> like RL3 but much more complex and able to handle a set of 50,000 or more rules

jo_kruger, the website is relatively lax on the limitations of RL3. For example, I'm curious what the max number of rules it can generally handle?

link

jo_kruger 2814 days ago

Sorry, we are working on a website - the project is quite mature (more than 10 years), but it was mostly for internal use. We just recently decided to make it public, which is not always easy to do. So we have a lot to do with documentation, etc.

Regarding your question -- there is no limitation on number of rules. But, I should say the definition of "number of rules" may be tricky. I don't see any reason to have a lot of high level rules -- i.e. annotators and asserts -- in most cases the number of annotators will be N time bigger than number of entity classes, or categories (in case of categorization task). On the other hand, there may be much more low-level rules (i.e. patterns and predicates used to form the high level rules).

Also, the built-in dawg dictionaries may help a lot - these dictionaries may handle millions of entries, behave same as other matchers (i.e. they can be used in same way as other patterns and regex matchers used), and work way faster than patterns. For instance, in RL3 you can define pattern \<{PERSON_FIRSTNAME},?\s{PERSON_LASTNAME}\> (which matches first name followed by optional comma followed by space and last name) where {PERSON_FIRSTNAME} and {PERSON_LASTNAME} are dictionaries..

link

PaulHoule 2814 days ago

50,000 rules is not that hard in 2018 if you have a modern RETE engine and the right indexes for lookups.

link

ar7hur 2815 days ago

Self-promoting our own Duckling https://duckling.wit.ai/ which has rules, but "prioritized" by ML -- and BSD license.

link

ppppppaul 2815 days ago

this is a lot like instaparse, which I think has a much better api, but is on the jvm, and may not perform as well. I haven't used it to perform something like a regex capture, but it may not be hard to do so.

link