Ask HN: How do you personally evaluate new LLM models?

I have some prompts ive saved and have been expanding as needed. I have say a dozen key features and a bunch of rules that need to be implemented. Not much is left for them to imagine. Then they need to get coding.

I also have 1 seat of my pants tests of 'give me a story' and its themed what my kid likes lately.

Overall from my testing, the good players like claude get it correct in the first go. Amazing. But i dont mind giving it feedback, what matters is how many times i need to recorrect it. qwen-coder was extremely excessive.