Hacker News new | ask | show | jobs
by survirtual 60 days ago
I use this metric now, and I suggest you change it per your imagination:

"Make a single-page HTML file using threejs from a CDN. Render a scene of a flying dinosaur orbiting a planet. There are clouds with thunder and lightning, and the background is a beautiful starscape with twinkling stars and a colorful nebula"

This allows me to evaluate several factors across models. It is novel and creative. I generally run it multiple times, though now that I have shared it here, I will come up with new scenes personally to evaluate.

I also consider how well it one shots, errors generated, response to errors being corrected, and velocity of iteration to improvement.

Generally speaking, Claude Sonnet has done the best, Qwen3.5 122B does second, and I have nice results from Qwen3.5 35B.

ChatGPT does not do well. It can complete the task without errors but the creativity is atrocious.