Hacker News new | ask | show | jobs
by jo_kruger 2815 days ago
This was actually one of the reasons we started RL3 -- make rules development not so expensive. Note, we started it more than 10 years ago as purely internal product. And just several months ago we decided to make it public (and currently it is free for personal / research and educational use). At the end, we were able to enable team of linguists (i.e. we actually searched for linguists - not even computational linguists) to write and support large library of NER patterns. Partially, it was possible due to named patterns -- i.e. we were able to develop sort of core library of patterns by small team of expensive resources (programmers and pro comp. linguists), and larger library of patterns based on this core...
1 comments

> like RL3 but much more complex and able to handle a set of 50,000 or more rules

jo_kruger, the website is relatively lax on the limitations of RL3. For example, I'm curious what the max number of rules it can generally handle?

Sorry, we are working on a website - the project is quite mature (more than 10 years), but it was mostly for internal use. We just recently decided to make it public, which is not always easy to do. So we have a lot to do with documentation, etc.

Regarding your question -- there is no limitation on number of rules. But, I should say the definition of "number of rules" may be tricky. I don't see any reason to have a lot of high level rules -- i.e. annotators and asserts -- in most cases the number of annotators will be N time bigger than number of entity classes, or categories (in case of categorization task). On the other hand, there may be much more low-level rules (i.e. patterns and predicates used to form the high level rules).

Also, the built-in dawg dictionaries may help a lot - these dictionaries may handle millions of entries, behave same as other matchers (i.e. they can be used in same way as other patterns and regex matchers used), and work way faster than patterns. For instance, in RL3 you can define pattern \<{PERSON_FIRSTNAME},?\s{PERSON_LASTNAME}\> (which matches first name followed by optional comma followed by space and last name) where {PERSON_FIRSTNAME} and {PERSON_LASTNAME} are dictionaries..

50,000 rules is not that hard in 2018 if you have a modern RETE engine and the right indexes for lookups.
I used RL3 to annotate words from multiple language dictionaries with total number of known words ~25 millions (i.e. 25m dictionary entries) - no problem at all