|
|
|
|
|
by vunderba
104 days ago
|
|
Yeah I think that it's part of the issue with a single "squashed" comparative metric. Some users are going to grade higher based on the overall visual fidelity and others are going to value following the prompt. For a point of reference, I run a pretty comprehensive image model comparison site heavily weighted in favor of prompt adherence. https://genai-showdown.specr.net EDIT: FWIW, I agree with your assessment. OpenAI's models have always been very strong in prompt adherence but visually weak (gpt-image-1 had the famous "piss filter" until they finally pushed out gpt-image-1.5) |
|
Did you manually review all the edit results manually yourself, or do you have some kind of automated procedure?