| My biggest gripe is that he's comparing probabilistic models (LLMs) by a single sample. You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers... Would be nicer to run the comparison with 10 images (or more) for each LLM and then average. |
I've been considering an expanded version of this where each model outputs ten images, then a vision model helps pick the "best" of those to represent that model in a further competition with other models.
(Then I would also expand the judging panel to three vision LLMs from different model families which vote on each round... partly because it will be interesting to track cases where the judges disagree.)
I'm not sure if it's worth me doing that though since the whole "benchmark" is pretty silly. I'm on the fence.