Hacker News new | ask | show | jobs
by TimPC 1232 days ago
Why do you think multimodal transformers will get us anywhere near general purpose AI? Multimodal transformers are basically a technology for sequence-to-sequence intelligent mappings and it seems to me extremely unlikely that general intelligence is one or more specific sequence-to-sequence mappings. Many specific purpose problems are sequence-to-sequence but these tend to be specialized functionalities operating in one or more specific domains.
2 comments

A lot of people don't really get that our brains are a bunch of specialized subcomponents that work in concert (Your pre-frontal cortex just cannot beat your heart, not matter how optimized it gets). This is unsurprising, as our brains are one of the most complex/hard to monitor things on earth.

When an artificial tool that is really a point solution "tricks" us into thinking it has replicated a task that requires complex multi-component functioning within our brain, we assume the tool is acting like our brain is acting.

The joke of course being that if you maliciously edited GPT's index for translating vectors to words, it would produce gibberish and we wouldn't care (despite being the exact same core model).

We are only impressed by the complex sequence to sequence strings it makes because the tokens happen to be words (arguable the most important things in our lives).

EDIT: a great historic metaphor for this is how we thought about 'computer vision' and CNN's. They do great at identifying things in images, but notice that we still use image-based captcha's (Even on OpenAI sites no less!)?

That's because it turns out optical illusions and context-heavy images are things that CNN's really struggle at (since the problem space is bigger than 'how are these pixels arranged')

A couple of things.

1) As I said, many people have different ideas of what we are talking about. I assume that for you general purpose AI has more capabilities, such as the ability to quickly learn tasks to a high level on the fly. For me, it still qualifies as general purpose if it can do most tasks but relies on a lot of pre-training and let's say knowledgebase look up.

2) It seems obvious to me that ChatGPT proves a general purpose utility for these types of LLMs, and it is easy to speculate that something similar but with visual input/output also will be even more general. And so we are just looking at a matter of degree by that definition.

For 1) I agree but ChatGPT is a specific purpose sequence to sequence model. It’s fairly obvious to me it’s not general purpose and it even fails sometimes at correctly reading content it generates. It also doesn’t understand correctness and often ends up generating incorrect content. Our best example of this not being general purpose is how staggeringly bad ChatGPT is at math which is blatantly obvious when you think about how it is designed.