|
You are mistaking the map for the territory. The TERRITORY of human experience is higher dimensional. The LLM utilizes a lower resolution mapping of that territory, a projection from experience to textual (or pixel, or waveform, etc.) representations. This is not just a lossy mapping; it excludes entire categories of experience that cannot be captured/encoded except for as a pointer to the real experience, one that is often shared by the embodied, embedded, enacted, and extended cognitive beings that have had that experience. I can point to beauty and you can understand me because you've experienced beauty. I cannot encode beauty itself. The LLM cannot experience beauty. It may be able to analyze patterns of things determined beautiful by beauty experiencers, but this is, again, a lower resolution map of the actual experience of beauty. Nobody had to train you to experience beauty—you possess that capability innately. You cannot encode the affective response one experiences when holding their newborn. You cannot encode the cognitive appraisal of a religious experience. You can't even encode the qualia of red except for, again, as a pointer to the color. You're also missing that 4E cognitive beings have a fundamental experience of consciousness—particularly the aspect of "here" and "now". The LLM cannot experience either of those phenomena. I cannot encode here and now. But you can, and do, experience both of those constantly. |
Once you grant that, the rest is bookkeeping. Any finite physical process has a finite physical trace. That trace is measurable to some precision. A finite trace can be serialized into a finite string of symbols. If you prefer bits, take a binary code. If you prefer integers, index the code words. The choice of alphabet does not matter. You can map a movie, a symphony, a spike train, a retina’s photon counts, or a full brain-body sensorium collected at some temporal resolution into a single long string. You lose nothing by serialization because the decoder knows the schema. This is not a “text only” claim. It is a claim about representation.
Your high dimensionality objection collapses under the same lens. High dimensional just means many coordinates. There is a well known result that any countable description can be put in one dimension by an invertible code. Think Gödel numbering or interleaving bits of coordinates. You do not preserve distances, but you do preserve information. If the thing you care about is the capacity to carry structure, the one dimensional string can carry all of it, and you can recover the original arrangement exactly given the decoding rule.
Now take the 4E point. Embodiment matters because it constrains the data distribution and the actions that follow. It does not create a magic type of information that cannot be encoded. A visual scene is photons on receptors over time. Proprioception is stretch receptor states. Affect is the joint state of particular neuromodulatory systems and network dynamics. Attention and working context are transient global variables implemented by assemblies. All of that can be logged, compressed, and restored to the degree your sensors and actuators allow. The fact that a bottle with a genome inside does not make a child on a beach tells you reproduction needs a decoder and an environment. It does not tell you the code fails to specify the organism. Likewise, an LLM plus a diffusion decoder can take a text latent and unfold it into an image distribution that matches world statistics because the bridge model plays the role of the environment for that domain.
“LLMs cannot experience beauty” simply reasserts the thing you want to prove. We have no privileged readout for human qualia either. We infer it from behavior, physiology, and report. We do not understand human brains at the level of complete causal microphysics because of scale and complexity, not because there is a non-physical remainder. We likewise do not fully understand why a large model makes a given judgment. Same reason. Scale and complexity. If you point to mystery on one side as a defect, you must admit it on the other.
The map versus territory line also misses the target. Of course a representation is not the thing itself. No one is claiming a jpeg is a sunset. The claim is that the structure necessary to act as if about sunsets can be encoded and learned. A system that takes in light fields, motor feedback, language, and reward and that updates an internal world model until its predictions and actions match ours to arbitrary precision will meet every operational test you have for meaning. If you reply that something is still missing, you have stepped outside evidence into stipulation.
So let’s keep the ground rules clear. Everything we are and feel is physically instantiated. Physical instantiations at finite precision admit lossless encodings as strings. Strings can be learned over by generic function approximators that optimize on pattern consistency, regardless of whether the symbols came from pixels, pressure sensors, or phonemes. That makes the “text inside, image outside” complaint irrelevant. The substrate is a detail. The constraint is data and objective.
We cannot yet build a full decoder for the human condition. That is a statement about engineering difficulty, not impossibility. And it cuts both ways. We do not know how to fully read a person either. But we do not conclude that people lack experience. We conclude that we lack understanding.