| I'd love to hear Chomsky's reaction to this stuff (or someone in his camp on the Chomsky vs. Norvig debate [0]). My understanding is that Chomsky was against statistical approaches to AI, as being scientifically un-useful - eventual dead ends, which would reach a certain accuracy, and plateau - as opposed to the purer logic/grammar approaches, which reductionistically/generatively decompose things into constituent parts, in some interpretable way, which is hence more scientifically valuable, and composable - easier to build on. But now we're seeing these very successful blended approaches, where you've got a grammatical search, which is reductionist, and produces an interpretable factoring of the sentence - but its guided by a massive (comparatively uninterpretable) neural net. It's like AlphaGo - which is still doing search, in a very structured, rule based, reductionist way - but leveraging the more black-box statistical neural network to make the search actually efficient, and qualitatively more useful. Is this an emerging paradigm? I used to have a lot of sympathy for the Compsky argument, and thought Norvig et al. [the machine learning community] could be accused of talking up a more prosaic 'applied ML' agenda into being more scientifically worthwhile than it actually was. But I think systems like this are evidence that gradual, incremental, improvement of working statistical systems, can eventually yield more powerful reductionist/logical systems overall.
I'd love to hear an opposing perspective from someone in the Chomsky camp, in the context of systems like this.
(Which I am hopefully not strawmanning here.) [0]Norvig's article: http://norvig.com/chomsky.html |
See e.g. http://visl.sdu.dk/~eckhard/pdf/TIL2006.pdf which gets 99% on POS and 96% on syntax function assignment – Constraint Grammar parsers are the state of the art of rule-based systems, and the well-developed ones beat statistical systems. CG's are also multitaggers – they don't assume a word has to have only one reading, it might actually be ambiguous, and in that case it shouldn't be further disambiguated (that's why they use F-scores instead of plain "accuracy").
CG's also require manual work, so it's not like you can download a corpus an unsupervisedly learn everything; but on the other hand, for what languages in the world do you have a large enough data set to unsupervisedly learn a good model? And for what training methods can you even get good models from unlabeled data? The set of languages for which there are large annotated corpora (especially treebanks) is even smaller … So CG's are also heavily used for lesser-resourced languages (typically in combination with finite state transducers for morphological analysis), where the lack of training data means it's a lot more cost-effective to write rules (and turn existing dictionaries into machine-readable FST's) than it is to create annotated training data (which would often involve OCR-ing texts, introducing yet another error source). CG writers still tend to have a very empirical mindset – no toy sentences like "put the cone on the block", but continual testing on any real-world text they can get their hands on.