Sorry for the late response. Yes that is Hinton's argument, and the claim made by the believers. On the other hand, if the GAC explanation is correct, an explanation might be that what we humans write down (that is, the training corpus) is a model of the world, and LLMs reconstruct (descriptions of) human understanding.
Now of course, the only input LLMs have is human text (for text only LLMs anyway). So their model is entirely dependent on how we see the world.
I wouldn't restrict LLMs to description of human understanding. They can articulate concepts in a rather sensible way, that wouldn't exist as is in the training corpus. Which exactly means that they have a model, however limited or imperfect.
"they can articulate concepts.. that [don't exist] in the training corpus" yes, but that doesn't necessarily mean they have a model [of the world]. You might want to say they are articulating the plausible (that is something that fits with our model of the world) but I think they are producing plausible articulations that we interpret against our model.
When you ask an LLM a question about cars, it needs an inner representation of what a car is (how imperfect it may be) to answer your question. A model of "language" as you want to define it would output a grammatically correct wall of text that goes nowhere.
A map of how concepts relate in language is not a model of the world, except on the extremely limited sense that languages are part or the world.
And yeah, that wasn't clear before people created those machines that can speak but can't think. But it should be completely obvious to anybody that interacts with them for a small while.
"How concepts relate" is called a model. That it uses language to be interacted with is irrelevant to the fact that it's a model of of a worldly concept.
What of multi modal models according to you ? Are they "models of eyesight", "models of sound", or pixels or wavelengths... C'mon.