Hacker News new | ask | show | jobs
by porphyra 440 days ago
Chatgpt 4o's advanced image generation seems to have a low-resolution autoregressive part that generates tokens directly, and an image upscaling decoding step that turns the (perhaps 100 px wide) token-image into the actual 1024 px wide final result. The former step is able to almost nail things perfectly, but the latter step will always change things slightly. That's why it is so good at, say, generating large text but still struggles with fine text, and will always introduce subtle variations when you ask it to edit an existing image.
1 comments

Has anyone tried putting in a model that selects the editing region prior to the process? Training data would probably be hard, but maybe existing image recognition tech that draws rectangles would be a start.