Hacker News new | ask | show | jobs
by ilaksh 2428 days ago
The AIs in the benchmark are all trained exclusively on text, correct?

My assumption has always been that to get human-level understanding, the AI systems need to be trained on things like visual data in addition to text. This is because there is a fair amount of information that is not encoded at all in text, or at least is not described in enough detail.

I mean, humans can't learn to understand language properly without using their other senses. You need something visual or auditory or to associate with the words which are really supposed to represent full systems that are complex and detailed.

I think it would be much more obvious if there were questions that involved things like spatial reasoning, or combining image recognition with that and comprehension.

3 comments

Mmm. The philosophical position that it's essential to be embodied in order to have intelligence seems intuitively reasonable but is very much unproven. You will find philosophers and cognitive scientists who are sure you're right, but they don't have much hard evidence, and you will also find people like me who are pretty sure you're wrong but likewise have no hard evidence.

In the specific remember that deaf-blind people exist, so if you're sure that you "need something visual or auditory" then those people are not, according to your beliefs, able to understand language. I think they'll disagree with you quite strongly.

> remember that deaf-blind people exist [... ...] able to understand language

I got curious if/how deafblind people learn to communicate in the first place, if they are completely deafblind from birth. If humans can learn not just communication but language without either vision or hearing, that seems to suggest either extreme adaptability or language learning being quite decoupled from vision and hearing. From an evolutionary standpoint, I imagine that both deafness and blindness are probably uncommon enough that language learning could have explicit dependencies on both hearing and vision.

I found an old-looking video about communication with deafblind people. At the linked timestamp is a woman who is deafblind since age 2.

https://youtu.be/usaf3bVVvjY?t=840

I think maybe CLEVR[0] dataset is what you are talking about?

Keep in mind that a most of the current ML systems have diverged from biology. A majority of the recent breakthroughs come from mathematics, the rational is that just because human brain does it in a certain way does not necessarily mean it is the only way to do it.

[0] https://cs.stanford.edu/people/jcjohns/clevr/

In no way do I think that AGI needs to mimic animal/human intelligence.

I was just trying to explain why text input alone isn't going to be adequate and that was an example.

Thanks for the link, that is one example of the type of thing I was talking about I think.

It's not just grounding the language in vision, but the embodiment, first person perspective and ability to interact with the environment. Humans have had the benefit of slowly evolving in a complex environment which is too expensive to recreate for artificial agents. We can only create very limited sims vs the real world.