Hacker News new | ask | show | jobs
by foooorsyth 941 days ago
The training space is more important. I don’t think a general intelligence will spawn from text corpuses. A person only able to consume text to learn would be considered severely disabled.

A significant part of intelligence comes from existence in meatspace and the ability to manipulate and observe that meatspace. A two year old learns much faster with much less data than any LLM.

2 comments

We already have multimodal models that take both images and text as input. The bulk of the training for these models was in text, not images. This shouldn’t be surprising. Text is a great way of abstractly and efficiently representing reality. Of course those patterns are useful for making sense of other modalities.

Beyond modeling the world, text is also a great way to model human thought and reason. People like to explain their thought process in writing. LLMs already pick up on and mimic chain of thought well.

Contained within large datasets is crystallized thought, and efficient descriptions of reality that have proven useful for processing modalities beyond text. To me that seems like a great foundation for AGI.

> To me that seems like a great foundation for AGI.

It's only one part, predicting text is relatively straightforward because it doesn't require predicting complex sequences like 'a S23mz s.zawsds'. Based on statistical analysis, there is a limited number of word combinations that humans use. With hundreds of billions of parameters, significant compression is possible. Mathematics is different as it requires actual reasoning, an area where LLMs often struggle significantly because they lack the capability for genuine reasoning.

Text and 2D images are a tiny subset of physical reality as perceived by an able-bodied human. Even our best approximation (3D VR headset with Spatial Audio) is a poor representation. We don’t even bother to simulate touch, temperature, equilibrio-sense, etc. And the more detailed you get, the less data you have.

These senses can be described via text, but I’m highly skeptical that the learning outcomes will be the same.

>> Text and 2D images are a tiny subset of physical reality as perceived by an able-bodied human. Even our best approximation is a poor representation.

This is wrong. There’s nothing magical about human perception. You see the world because a 2D image is projected onto your retina.

GPT-4 was trained on text and generalized the ability to output 2D images. There’s absolutely nothing to suggest text can’t generalize further to new modalities. GPT4 is forced to serialize images as SVGs to output them (a crazy emergent ability btw), but that demonstrates an inherent spatial reasoning capability baked into the model.

GPT4V was created with a transfer learning step where image embeddings are passed as input in place of text. That’s further evidence of models ability to generalize to new modalities.

Everything you need to do multimodal input and output is already trained in, GPT-4V I’m sure is just the start.

>GPT-4 was trained on text

And it shows. It has a poor grasp of reality. It does a poor job with complex tasks. It cannot be trusted with specialized tasks typically done by expert humans. It is certainly an amazing technical achievement that does a decent job with simple tasks requiring cursory knowledge, but that’s all it is at this time.

>There’s absolutely nothing to suggest text can’t generalized further to new modalities

Inversion of burden of proof.

>> Inversion of burden of proof

Nope. OpenAI has already demonstrated the ability to generalize GPT4 to a new modality. Your claim that text models can only generalize to images and not other modalities is utterly unconvincing. Explain to me why vision is so much different than say audio?

>> And it shows. It has a poor grasp of reality. It does a poor job with complex tasks.

GPT4 is a proof of concept more than anything. I’m excited to see how much reliability improves over time. It’s grasp of reality isn’t prefect, but at least it understands how burden of proof works.

>GPT4 is a proof of concept more than anything

Hilarious walk-back. “Text can generalize anything” —-> “It’s just a demo, bro” in the same post.

Lmao

A two year old learns faster because it has inherited training data from its ancestors in the form of evolutionary memory. Think of it as a BIOS for human beings. The LLM takes longer to learn because we are building this BIOS for it. Remember it took billions of years for the human BIOS to be developed.