| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by valine 941 days ago

We already have multimodal models that take both images and text as input. The bulk of the training for these models was in text, not images. This shouldn’t be surprising. Text is a great way of abstractly and efficiently representing reality. Of course those patterns are useful for making sense of other modalities.

Beyond modeling the world, text is also a great way to model human thought and reason. People like to explain their thought process in writing. LLMs already pick up on and mimic chain of thought well.

Contained within large datasets is crystallized thought, and efficient descriptions of reality that have proven useful for processing modalities beyond text. To me that seems like a great foundation for AGI.

2 comments

lossolo 940 days ago

> To me that seems like a great foundation for AGI.

It's only one part, predicting text is relatively straightforward because it doesn't require predicting complex sequences like 'a S23mz s.zawsds'. Based on statistical analysis, there is a limited number of word combinations that humans use. With hundreds of billions of parameters, significant compression is possible. Mathematics is different as it requires actual reasoning, an area where LLMs often struggle significantly because they lack the capability for genuine reasoning.

link

foooorsyth 940 days ago

Text and 2D images are a tiny subset of physical reality as perceived by an able-bodied human. Even our best approximation (3D VR headset with Spatial Audio) is a poor representation. We don’t even bother to simulate touch, temperature, equilibrio-sense, etc. And the more detailed you get, the less data you have.

These senses can be described via text, but I’m highly skeptical that the learning outcomes will be the same.

link

valine 940 days ago

>> Text and 2D images are a tiny subset of physical reality as perceived by an able-bodied human. Even our best approximation is a poor representation.

This is wrong. There’s nothing magical about human perception. You see the world because a 2D image is projected onto your retina.

GPT-4 was trained on text and generalized the ability to output 2D images. There’s absolutely nothing to suggest text can’t generalize further to new modalities. GPT4 is forced to serialize images as SVGs to output them (a crazy emergent ability btw), but that demonstrates an inherent spatial reasoning capability baked into the model.

GPT4V was created with a transfer learning step where image embeddings are passed as input in place of text. That’s further evidence of models ability to generalize to new modalities.

Everything you need to do multimodal input and output is already trained in, GPT-4V I’m sure is just the start.

link

foooorsyth 940 days ago

>GPT-4 was trained on text

And it shows. It has a poor grasp of reality. It does a poor job with complex tasks. It cannot be trusted with specialized tasks typically done by expert humans. It is certainly an amazing technical achievement that does a decent job with simple tasks requiring cursory knowledge, but that’s all it is at this time.

>There’s absolutely nothing to suggest text can’t generalized further to new modalities

Inversion of burden of proof.

link

valine 940 days ago

>> Inversion of burden of proof

Nope. OpenAI has already demonstrated the ability to generalize GPT4 to a new modality. Your claim that text models can only generalize to images and not other modalities is utterly unconvincing. Explain to me why vision is so much different than say audio?

>> And it shows. It has a poor grasp of reality. It does a poor job with complex tasks.

GPT4 is a proof of concept more than anything. I’m excited to see how much reliability improves over time. It’s grasp of reality isn’t prefect, but at least it understands how burden of proof works.

link

foooorsyth 940 days ago

>GPT4 is a proof of concept more than anything

Hilarious walk-back. “Text can generalize anything” —-> “It’s just a demo, bro” in the same post.

Lmao

link

valine 940 days ago

I walked back nothing. OpenAI was surprised by the mass adoption of ChatGPT, they saw it as an early technical preview.

I don’t understand why some people have a such hard time envisioning the potential of new technologies without a polished end product in their hands. Imagine if AI researchers had the same attitude.

Technology can be both real and unpolished at the same time. Those two things are not contradictory.

link