|
|
|
|
|
by midlightdenight
1195 days ago
|
|
This has always made me curious about multi modality and especially Google’s Palm. At least how google presents Palm in diagrams. The way I’ve interpreted the scaling hypothesis is that we will see emergent intelligence with larger nets through automated training alone. If we want a model to learn images we throw a larger net at it with training data. The way I’ve interpreted some of these newer techniques to multi modality is that they are stitched together models. If we want a model to learn images we decide this and teach one model images and then connect it to the core model. There’s not a lot of emergent behavior due to scaling in this scenario. With that perception I don’t see how gpt4 says anything about the scaling hypothesis. However I am not in this field and would be grateful to learn more. |
|
So the goal should be: we've created a "language module" (LLMs) and a "visual perception module" (computer vision), but we also need to add a "logic module", a "reasoning module", an "empathy module", etc, while continuing to improve each.
I just don't see how you could get an LLM, no matter how advanced, to recognize a car. Even if it can describe cars (wheels, windshield, doors) it doesn't know what any of those components look like. It's like that old joke about philosophers being unable to define a chair beyond "I'll know it when I see it".