| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by PaulHoule 2820 days ago

Funny.

I went to a talk by Edward Snowden's boss the week before Edward went rouge and got an introduction to the kind of text analysis system which is used by people who work for three letter agencies in places like Maryland and Virginia.

It was a product from BBN that looked a lot like RL3 but much more complex and able to handle a set of 50,000 or more rules in the assumption that 50 people are going to write 1,000 rules in a week as opposed to 1 expert is going to write 50 a day and then take 3 years to write that many rules.

This is an "expert system" which has to perform well in some particular domain so it might be analyzing police reports which have a certain structure, setting and cast of characters. (or medical reports or corporate filings or what did airline customers say about their flight?)

It is manageable to make a rule-based system that performs very well on specific extraction tasks and even prove that it works up to a certain point with statistical bounds. The metrics are different but I believe those systems would be laughed ought of court if they couldn't tune it up for 0.95 or better precision in some cases.

It's not cheap to write 50,000 rules but making training sets for machine learning is a lot of work too. I get good classifiers (better than mech turks) sometimes with 2000-5000 samples and some friends of mine in South America would expect you to provide about 20,000 samples for an simple data extraction pipeline.

Unfortunately I can't point you to references in the academic literature because the data sets are proprietary. NLP engineers look at metrics all day that they can't show you and it is too bad because you get a distorted picture of machine learning if you are doing Yann LeCunn's digits because that is a carefully built 'toy' problem which has had obstacles cleared away for you that is just big enough that the result looks miraculous.

1 comments

jo_kruger 2820 days ago

This was actually one of the reasons we started RL3 -- make rules development not so expensive. Note, we started it more than 10 years ago as purely internal product. And just several months ago we decided to make it public (and currently it is free for personal / research and educational use). At the end, we were able to enable team of linguists (i.e. we actually searched for linguists - not even computational linguists) to write and support large library of NER patterns. Partially, it was possible due to named patterns -- i.e. we were able to develop sort of core library of patterns by small team of expensive resources (programmers and pro comp. linguists), and larger library of patterns based on this core...

link

yazaddaruvala 2819 days ago

> like RL3 but much more complex and able to handle a set of 50,000 or more rules

jo_kruger, the website is relatively lax on the limitations of RL3. For example, I'm curious what the max number of rules it can generally handle?

link

jo_kruger 2819 days ago

Sorry, we are working on a website - the project is quite mature (more than 10 years), but it was mostly for internal use. We just recently decided to make it public, which is not always easy to do. So we have a lot to do with documentation, etc.

Regarding your question -- there is no limitation on number of rules. But, I should say the definition of "number of rules" may be tricky. I don't see any reason to have a lot of high level rules -- i.e. annotators and asserts -- in most cases the number of annotators will be N time bigger than number of entity classes, or categories (in case of categorization task). On the other hand, there may be much more low-level rules (i.e. patterns and predicates used to form the high level rules).

Also, the built-in dawg dictionaries may help a lot - these dictionaries may handle millions of entries, behave same as other matchers (i.e. they can be used in same way as other patterns and regex matchers used), and work way faster than patterns. For instance, in RL3 you can define pattern \<{PERSON_FIRSTNAME},?\s{PERSON_LASTNAME}\> (which matches first name followed by optional comma followed by space and last name) where {PERSON_FIRSTNAME} and {PERSON_LASTNAME} are dictionaries..

link

PaulHoule 2819 days ago

50,000 rules is not that hard in 2018 if you have a modern RETE engine and the right indexes for lookups.

link

jo_kruger 2819 days ago

I used RL3 to annotate words from multiple language dictionaries with total number of known words ~25 millions (i.e. 25m dictionary entries) - no problem at all

link