|
|
|
|
|
by TimPC
1232 days ago
|
|
Why do you think multimodal transformers will get us anywhere near general purpose AI? Multimodal transformers are basically a technology for sequence-to-sequence intelligent mappings and it seems to me extremely unlikely that general intelligence is one or more specific sequence-to-sequence mappings. Many specific purpose problems are sequence-to-sequence but these tend to be specialized functionalities operating in one or more specific domains. |
|
When an artificial tool that is really a point solution "tricks" us into thinking it has replicated a task that requires complex multi-component functioning within our brain, we assume the tool is acting like our brain is acting.
The joke of course being that if you maliciously edited GPT's index for translating vectors to words, it would produce gibberish and we wouldn't care (despite being the exact same core model).
We are only impressed by the complex sequence to sequence strings it makes because the tokens happen to be words (arguable the most important things in our lives).
EDIT: a great historic metaphor for this is how we thought about 'computer vision' and CNN's. They do great at identifying things in images, but notice that we still use image-based captcha's (Even on OpenAI sites no less!)?
That's because it turns out optical illusions and context-heavy images are things that CNN's really struggle at (since the problem space is bigger than 'how are these pixels arranged')