| HN Mirror

I looked into the documentation and api, and it seems you are right, it is genuinely part of the gpt model. Of course, we cannot confirm without source code.

My understanding was that there was a traditional cv library that was effectively producing an image to text before passing it to the llm. But the more I think about it, even that method would involve training for image detection to a point where objects are recognized by images not by tokens.

So the gpt product is no longer an llm or text based.

Can't say much for sure at this point with closed source, we will probably see competition catch up eventually and have more info then. At which point openai will eventually release the text2img separately and dispense with the mysticism and agi pretention.

My guess is that this is a separate image to text model ( or image+text model) and it is slapped on to the main llm code.

I don't think that text is just another modality, it probably will always be the core.

I don't have a source on something as strategic and subjective, I just have an finger on the pulse: their robot demo that does laundry, their consistent talk about AGI, their mention of power-seeking in docs, their attempt to raise trillions for chip factories, transition to for profit. They have a huge pressure to be THE monopoly and their risk is for GPT to be a text based local maximum and for intelligence not to be a sappir wolphian phenomenon.

P.s: early docs from 2023 refer to the img2txt submodel as gpt4v, that's what we should call the submodule in my opinion. (If it in fact is the same piece of tech)