| Wow, I love this benchmark - I've been doing something similar (as a joke for and much less frequently), where I ask multiple models to attempt to create a data structure like: ```
const melody = [
{ freq: 261.63, duration: 'quarter' }, // C4
{ freq: 0, duration: 'triplet' }, // triplet rest
{ freq: 293.66, duration: 'triplet' }, // D4
{ freq: 0, duration: 'triplet' }, // triplet rest
{ freq: 329.63, duration: 'half' }, // E4
]
``` But with the intro to Smoke on the Water by Deep Purple. Then I run it through the Web Audio API and see how it sounds. It's never quite gotten it right, but it's gotten better, to the point where I can ask it to make a website that can play it. I think yours is a lot more thoughtful about testing novelty, but its interesting to see them attempt to do things that they aren't really built for (in theory!). https://codepen.io/mvattuone/pen/qEdPaoW - ChatGPT 4 Turbo https://codepen.io/mvattuone/pen/ogXGzdg - Claude Sonnet 3.7 https://codepen.io/mvattuone/pen/ZYGXpom - Gemini 2.5 Pro Gemini is by far the best sounding one, but it's still off. I'd be curious how the latest and greatest (paid) versions fare. (And just for comparison, here's the first time I did it... you can tell I did the front-end because there isn't much to it!) https://nitter.space/mvattuone/status/1646610228748730368#m |
Also, if it becomes popular, training sets may pick it up and improve models unfairly and unrealistically. But that's true of any known benchmark.
Side note: I'd really like to see the Language Benchmark Game become a prompt based languages * models benchmark game. So we could say model X excels at Python Fasta, etc. although then the risk is that, again, it becomes training set and the whole thing self-rigs itself.