Hacker News new | ask | show | jobs
by throwaway314155 522 days ago
How would you prefer people talk about it? "Multimodal LLM"? My understanding is the vision portion is indeed wired directly to (and trained alongside) the language portion.

> give the appearance of agi

Can you point out where specifically they're doing this? Best I can tell, they give a decent summary of the effectiveness of multi-modal LLM's with support for vision, and then talk about using it to solve an incredibly narrow task. The only diction I could see that hints at "agi" is when they describe the versatility of this approach; but how could you possibly argue against that? It's objectively more versatile (if not wasteful and more expensive).

1 comments

I looked into the documentation and api, and it seems you are right, it is genuinely part of the gpt model. Of course, we cannot confirm without source code.

My understanding was that there was a traditional cv library that was effectively producing an image to text before passing it to the llm. But the more I think about it, even that method would involve training for image detection to a point where objects are recognized by images not by tokens.

So the gpt product is no longer an llm or text based.

Can't say much for sure at this point with closed source, we will probably see competition catch up eventually and have more info then. At which point openai will eventually release the text2img separately and dispense with the mysticism and agi pretention.

My guess is that this is a separate image to text model ( or image+text model) and it is slapped on to the main llm code.

I don't think that text is just another modality, it probably will always be the core.

I don't have a source on something as strategic and subjective, I just have an finger on the pulse: their robot demo that does laundry, their consistent talk about AGI, their mention of power-seeking in docs, their attempt to raise trillions for chip factories, transition to for profit. They have a huge pressure to be THE monopoly and their risk is for GPT to be a text based local maximum and for intelligence not to be a sappir wolphian phenomenon.

P.s: early docs from 2023 refer to the img2txt submodel as gpt4v, that's what we should call the submodule in my opinion. (If it in fact is the same piece of tech)