|
|
|
|
|
by valine
940 days ago
|
|
>> Text and 2D images are a tiny subset of physical reality as perceived by an able-bodied human. Even our best approximation is a poor representation. This is wrong. There’s nothing magical about human perception. You see the world because a 2D image is projected onto your retina. GPT-4 was trained on text and generalized the ability to output 2D images. There’s absolutely nothing to suggest text can’t generalize further to new modalities. GPT4 is forced to serialize images as SVGs to output them (a crazy emergent ability btw), but that demonstrates an inherent spatial reasoning capability baked into the model. GPT4V was created with a transfer learning step where image embeddings are passed as input in place of text. That’s further evidence of models ability to generalize to new modalities. Everything you need to do multimodal input and output is already trained in, GPT-4V I’m sure is just the start. |
|
And it shows. It has a poor grasp of reality. It does a poor job with complex tasks. It cannot be trusted with specialized tasks typically done by expert humans. It is certainly an amazing technical achievement that does a decent job with simple tasks requiring cursory knowledge, but that’s all it is at this time.
>There’s absolutely nothing to suggest text can’t generalized further to new modalities
Inversion of burden of proof.