| "What should or shouldn’t be a wh-island" is literally a statement of "what words might come after some other words"! An LLM encodes billions of such statements, just unfortunately in a quantity and form that makes them incomprehensible to an unaided human. That part is strictly worse; but the LLM's statements model language well enough to generate it, and that part is strictly better. As I read Norvig's essay, it's about that tradeoff, of whether a simple and comprehensible but inaccurate model shows more promise than a model that's incomprehensible except in statistical terms with the aid of a computer, but far more accurate. I understand there's a large group of people who think Norvig is wrong or incoherent; but when those people have no accomplishments except within the framework they themselves have constructed, what am I supposed to think? Beyond that, if I have a model that tells me whether a sentence is valid, then I can always try different words until I find one that makes it valid. Any sufficiently good model is thus capable of generation. Chomsky never proposed anything capable of that; but that just means his models were bad, not that he was working on a different task. As to the relationship between signals from biological neurons and ANN activations, I mean something like the paper linked below, whose authors write: > Thus, even though the goal of contemporary AI is to improve model performance and not necessarily to build models of brain processing, this endeavor appears to be rapidly converging on architectures that might capture key aspects of language processing in the human mind and brain. https://www.biorxiv.org/content/10.1101/2020.06.26.174482v3.... I emphasize again that I believe these results have been oversold in the popular press, but the idea that an ANN trained on brain output (including written language) might provide insight into the physical, causal structure of the brain is pretty mainstream now. |
This gets at the nub of the misunderstanding. Chomsky is interested in modeling the range of grammatical structures and associated interpretations possible in natural languages. The wh-island condition is a universal structural constraint that only indirectly (and only sometimes) has implications for which sequences of words are ‘valid’ in a particular language.
LLMs make no prediction at all as to whether or not natural languages should have wh-islands: they’ll happily learn languages with or without such constraints.
If you want a more concrete example of why wh-islands can’t be understood in terms of permissible or impermissible sequences of words, consider cases like
How often did you ask why John took out the trash?
The wh-island created by ‘why’ removes one of the in-principle possible interpretations (the embedded question reading where ‘how often’ associates with ‘took’), but the sequence of words is fine.
> Chomsky never proposed anything capable of that; but that just means his models were bad, not that he was working on a different task.
No, Chomsky really was working on a different task: a solution to the logical problem of language acquisition and a theory of the range of possible grammatical variation across human languages. There is no reason to think that a perfect theory in this domain would be of any particular help in generating plausible-looking text. From a cognitive point of view, text generation rather obviously involves the contribution of many non-linguistic cognitive systems which are not modeled (nor intended to be modeled) by a generative grammar.
>the paper linked below
This paper doesn’t make any claims that are obviously incompatible with anything that Chomsky has said. The fundamental finding is unsurprising: brains are sensitive to surprisal. The better your language model is at modeling whether or not a sequence of words is likely, the better you can predict the brain’s surprisal reactions. There are no implications for cognitive architecture. This ought to be clear from that fact that a number of different neural net architectures are able to achieve a good degree of success, according to the paper’s own lights.