| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by unhammer 3692 days ago

I don't think you're strawmanning in general, there have been a lot of symbolic AI people who scoffed at any mention of statistics or real-world data, but it's not the case that you have to eschew all empiricism just because you use rules.

See e.g. http://visl.sdu.dk/~eckhard/pdf/TIL2006.pdf which gets 99% on POS and 96% on syntax function assignment – Constraint Grammar parsers are the state of the art of rule-based systems, and the well-developed ones beat statistical systems. CG's are also multitaggers – they don't assume a word has to have only one reading, it might actually be ambiguous, and in that case it shouldn't be further disambiguated (that's why they use F-scores instead of plain "accuracy").

CG's also require manual work, so it's not like you can download a corpus an unsupervisedly learn everything; but on the other hand, for what languages in the world do you have a large enough data set to unsupervisedly learn a good model? And for what training methods can you even get good models from unlabeled data? The set of languages for which there are large annotated corpora (especially treebanks) is even smaller … So CG's are also heavily used for lesser-resourced languages (typically in combination with finite state transducers for morphological analysis), where the lack of training data means it's a lot more cost-effective to write rules (and turn existing dictionaries into machine-readable FST's) than it is to create annotated training data (which would often involve OCR-ing texts, introducing yet another error source). CG writers still tend to have a very empirical mindset – no toy sentences like "put the cone on the block", but continual testing on any real-world text they can get their hands on.