|
|
|
|
|
by foooorsyth
940 days ago
|
|
Text and 2D images are a tiny subset of physical reality as perceived by an able-bodied human. Even our best approximation (3D VR headset with Spatial Audio) is a poor representation. We don’t even bother to simulate touch, temperature, equilibrio-sense, etc. And the more detailed you get, the less data you have. These senses can be described via text, but I’m highly skeptical that the learning outcomes will be the same. |
|
This is wrong. There’s nothing magical about human perception. You see the world because a 2D image is projected onto your retina.
GPT-4 was trained on text and generalized the ability to output 2D images. There’s absolutely nothing to suggest text can’t generalize further to new modalities. GPT4 is forced to serialize images as SVGs to output them (a crazy emergent ability btw), but that demonstrates an inherent spatial reasoning capability baked into the model.
GPT4V was created with a transfer learning step where image embeddings are passed as input in place of text. That’s further evidence of models ability to generalize to new modalities.
Everything you need to do multimodal input and output is already trained in, GPT-4V I’m sure is just the start.