|
|
|
|
|
by gslepak
24 days ago
|
|
On page 102 of the system card [1] I'm pleased to see evaluation against "creative mastery". In our work we asked several frontier AIs to come up with an API we needed. We compared Opus 4.7 and GPT-5.5 (among others). Opus 4.7 came up with the most creative and intelligent API design that pleasantly surprised us, especially given that GPT-5.5 was passing it on various coding benchmarks. What I noticed is that we don't have a commons benchmark to measure "creativity" and "ingenuity", and in some ways such a benchmark would conflict with the common IFBench benchmark. Yet this is a very important skill when designing systems. I'm glad to see Anthropic putting thought into it, and would love to see a public benchmark for this that other models could compare themselves to. [1] https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0... |
|