|
|
|
|
|
by AnimalMuppet
553 days ago
|
|
> I don't think it'd be fully correct to say that knowledge is only encoded by relations between words. The input/output of the model is tokens of text, but internally it'll be converted into high-dimensional semantic vector spaces of concepts. All right, how about this: LLMs do have actual knowledge - the knowledge that was encoded in the words in the training data. That's not how they store the data internally, but the actual knowledge comes from there. And I wasn't saying that that's enough. I was saying that the LLM advocates think, or at least claim, that it's enough. |
|
For non-multimodal models, and minus ephemeral context and what's encoded by the architecture (like the translational invariance of CNNs), I'd agree to that.
> And I wasn't saying that that's enough. I was saying that the LLM advocates think, or at least claim, that it's enough.
Most modern LLMs like GPT-4, LLaMA-3.2, Gemini, or Claude 3.5 are already multimodal (text, images, sometimes video, sometimes audio). If you primarily just meant that's a good pathway to building richer internal world representations (and thus better at answering questions involving 3D geometry, for instance) then I'd also agree there, though I don't see why it'd be a requirement for reasoning/etc. (opposed to just beneficial).