Hacker News new | ask | show | jobs
by simonw 40 days ago
I believe it's an evolution of the technique used in GPT-Image-1 (or whatever they called that), which was derived from their work on making GPT-4o an "omni" model that can directly output images and audio in addition to text.

The 2024 GPT-4o launch post https://openai.com/index/hello-gpt-4o/ hints about how that works:

"With GPT‑4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network."

1 comments

Yeah, that's my belief as well, but haven't seen any concrete explanations about how it works, just the marketing/press releases sadly.