| HN Mirror

> That conflates perception with perceiver.

I'm not sure I understand. Can you elaborate?

> From a model of the mind pov, the 'self' that we sense has an internal LLM-like tool. And it is that self that understands and not the tool.

I'm starting to think it's the other way around. I think it's somewhat widely accepted that our brains do most of the "thinking" and "understanding" unconsciously - our conscious self is more of an observer / moderator, occasionally hand-holding the thought process when the topic of interest is hard, and one isn't yet proficient[0] in it.

Keeping that in mind, if you - like me - feel that LLMs are best compared to our "inner voice", i.e. the bit on the boundary between conscious and unconscious that uses language as an interface to the former, then it's not unreasonable to expect that LLMs may, in fact, understand things. Not emulate, but actually understand.

The whole deal with a hundred thousand dimensional latent space? I have a growing suspicion that this is exactly the fundamental principle behind how understanding, thinking in concepts, and thinking in general works for humans too. Sure, we have multiple senses feeding into our "thinking" bit, but that doesn't change much.

At a conceptual, handwavy level (I don't know the actual architecture and math details well enough to offer more concrete explanations/stories), I feel there are too many coincidences to ignore.

Is this coincidence that someone trained an LLM and an image network, and found their independently learned latent spaces map to each other with a simple transforms? Maybe[1], but this also makes sense - both network segmented data about the same view of reality humans have. There is no reason for LLMs to have an entirely different way of representing "understanding" than img2txt or txt2img networks.

Assuming the above is true, is this coincidence that it offers a decent explanation for how humans developed language? You start with a image/sound/touch/other senses acquisition and association system forming a basic brain. Predicting next sensations, driving actions. As it evolves in size and complexity, dimensionality of its representation space grows, and at some point, the associations cluster in something of a world model. Let evolution iterate some (couple hundred thousand years) more, and you end up with brains that can build more complex world model, working with more complex associations (e.g. vibration -> sound -> tone -> grunt -> phrase/song). At this level, language seems like an obvious thing - it's taking complex associations of basic sensory input, and associating them wholesale with different areas of the latent space, so that e.g. a specific grunt now associates with danger, a different one with safety, etc. and once you have brains being able to do that naturally, it's pretty much straight line to a proper language.

Yes, this probably comes as a lot of hand-waving; I don't have the underlying insights properly sorted yet. But a core observation I want to communicate, and recommend people to ponder on, is continuity. This process gains capabilities in a continuous fashion, as it scales - which is exactly a kind of system you'd expect evolution to lock on to.

[0] - What is "proficiency" anyway? To me, being proficient in a field of interest is mostly about... shifting understanding of that field to unconscious level as much as possible.

[1] - This was one paper I am aware of; they probably didn't do good enough control, so it might turn out to be happenstance.