Seems to be as complicated as regex to me. Perhaps I am missing something? Any other libraries like this that are decent at this and simplify the problem?
We started RL3 (more than 10 years ago) because we had several projects with a huge number of patterns. We found other projects (present at that moment) were too heavy on a syntax, which make it complicated to support / manage large library of patterns. So, we tried to keep the power of regex, add new features (like named patterns, modules, templates and lookup dictionaries) but minimize additional syntax... as result we were able to enable team of computational linguists (i.e. not programmers) quite easily develop and support huge libraries of NER patterns and document classification rules.
Yes. The most notable are https://www.aihitdata.com and https://www.happygrumpy.com
First crawls corporate websites (~25 millions) and extracts key information such as people, contacts, etc.
Second is a sentiment analysis tool.
It is actually based on regex (on a low level). But the key difference -- it allows named patterns, modules, templates, and kind of classes. So, you may have a separate module with a definitions of related patterns - for instance, module for date and time patterns, module for location patterns, etc... then, you can include required module to your project and refer to these patterns (by name) from your own patterns.. like "{PERSON_NAME}[,:\s]?{JOB_TITLE}" may be a pattern which matches person name followed by optional punctuation and the by job title... where PERSON_NAME and JOB_TITLE are quite complicated patterns defined somewhere else.
Another key difference - it supports huge lookup dictionaries -- i.e. part of your pattern may be reference to dictionary which may contain thousands or even millions of entries... so, PERSON_NAME can be a dictionary.
https://gate.ac.uk is one of the most extensive and widely used. Note that rules based techniques are generally not as effective as machine learning based techniques though. GATE actually supports both ML and rules.
I went to a talk by Edward Snowden's boss the week before Edward went rouge and got an introduction to the kind of text analysis system which is used by people who work for three letter agencies in places like Maryland and Virginia.
It was a product from BBN that looked a lot like RL3 but much more complex and able to handle a set of 50,000 or more rules in the assumption that 50 people are going to write 1,000 rules in a week as opposed to 1 expert is going to write 50 a day and then take 3 years to write that many rules.
This is an "expert system" which has to perform well in some particular domain so it might be analyzing police reports which have a certain structure, setting and cast of characters. (or medical reports or corporate filings or what did airline customers say about their flight?)
It is manageable to make a rule-based system that performs very well on specific extraction tasks and even prove that it works up to a certain point with statistical bounds. The metrics are different but I believe those systems would be laughed ought of court if they couldn't tune it up for 0.95 or better precision in some cases.
It's not cheap to write 50,000 rules but making training sets for machine learning is a lot of work too. I get good classifiers (better than mech turks) sometimes with 2000-5000 samples and some friends of mine in South America would expect you to provide about 20,000 samples for an simple data extraction pipeline.
Unfortunately I can't point you to references in the academic literature because the data sets are proprietary. NLP engineers look at metrics all day that they can't show you and it is too bad because you get a distorted picture of machine learning if you are doing Yann LeCunn's digits because that is a carefully built 'toy' problem which has had obstacles cleared away for you that is just big enough that the result looks miraculous.
This was actually one of the reasons we started RL3 -- make rules development not so expensive. Note, we started it more than 10 years ago as purely internal product. And just several months ago we decided to make it public (and currently it is free for personal / research and educational use). At the end, we were able to enable team of linguists (i.e. we actually searched for linguists - not even computational linguists) to write and support large library of NER patterns. Partially, it was possible due to named patterns -- i.e. we were able to develop sort of core library of patterns by small team of expensive resources (programmers and pro comp. linguists), and larger library of patterns based on this core...
Sorry, we are working on a website - the project is quite mature (more than 10 years), but it was mostly for internal use. We just recently decided to make it public, which is not always easy to do. So we have a lot to do with documentation, etc.
Regarding your question -- there is no limitation on number of rules. But, I should say the definition of "number of rules" may be tricky. I don't see any reason to have a lot of high level rules -- i.e. annotators and asserts -- in most cases the number of annotators will be N time bigger than number of entity classes, or categories (in case of categorization task). On the other hand, there may be much more low-level rules (i.e. patterns and predicates used to form the high level rules).
Also, the built-in dawg dictionaries may help a lot - these dictionaries may handle millions of entries, behave same as other matchers (i.e. they can be used in same way as other patterns and regex matchers used), and work way faster than patterns. For instance, in RL3 you can define pattern \<{PERSON_FIRSTNAME},?\s{PERSON_LASTNAME}\> (which matches first name followed by optional comma followed by space and last name) where {PERSON_FIRSTNAME} and {PERSON_LASTNAME} are dictionaries..
this is a lot like instaparse, which I think has a much better api, but is on the jvm, and may not perform as well. I haven't used it to perform something like a regex capture, but it may not be hard to do so.