| This shows some gaps in the "same prompt to every model" approach to benchmarking models. I get that it's allows ensuring you're testing the model capabilities vs prompts, but most models are being post-trained with very different formats of prompting. I use Seedream in production so I was a little suspicious of the gap: I passed Bytedance's official prompting guide, OPs prompt, and your feedback to Claude Opus 4.5 and got this prompt to create a new image: > A partially eaten chicken burrito with a bite taken out, revealing the fillings inside: shredded cheese, sour cream, guacamole, shredded lettuce, salsa, and pinto beans all visible in the cross-section of the burrito. Flour tortilla with grill marks. Taken with a cheap Android phone camera under harsh cafeteria lighting. Compostable paper plate, plastic fork, messy table. Casual unedited snapshot, slightly overexposed, flat colors. Then I generated with n=4 and the 'standard' prompt expansion setting for Seedream 4.0 Text To Image: https://imgur.com/a/lxKyvlm They're still not perfect (it's not adhering to the fillings being inside for example) but it's massively better than OP's result Shows that a) random chance plays a big part, so you want more than 1 sample and b) you don't have to "cheat" by spending massive amounts of time hand-iterating on a single prompt either to get a better result |
Including a "total rolls" is a very valuable metric since it helps indicate how steerable the model is.