| Funny. I went to a talk by Edward Snowden's boss the week before Edward went rouge and got an introduction to the kind of text analysis system which is used by people who work for three letter agencies in places like Maryland and Virginia. It was a product from BBN that looked a lot like RL3 but much more complex and able to handle a set of 50,000 or more rules in the assumption that 50 people are going to write 1,000 rules in a week as opposed to 1 expert is going to write 50 a day and then take 3 years to write that many rules. This is an "expert system" which has to perform well in some particular domain so it might be analyzing police reports which have a certain structure, setting and cast of characters. (or medical reports or corporate filings or what did airline customers say about their flight?) It is manageable to make a rule-based system that performs very well on specific extraction tasks and even prove that it works up to a certain point with statistical bounds. The metrics are different but I believe those systems would be laughed ought of court if they couldn't tune it up for 0.95 or better precision in some cases. It's not cheap to write 50,000 rules but making training sets for machine learning is a lot of work too. I get good classifiers (better than mech turks) sometimes with 2000-5000 samples and some friends of mine in South America would expect you to provide about 20,000 samples for an simple data extraction pipeline. Unfortunately I can't point you to references in the academic literature because the data sets are proprietary. NLP engineers look at metrics all day that they can't show you and it is too bad because you get a distorted picture of machine learning if you are doing Yann LeCunn's digits because that is a carefully built 'toy' problem which has had obstacles cleared away for you that is just big enough that the result looks miraculous. |