I've been pretty satisfied using oh-my-openagent (omo) on opencode with both opus-4.6 and gpt-5.4 lately.
The author of omo suggests different prompting strategies for different models and goes into some detail here.
https://github.com/code-yeongyu/oh-my-openagent/blob/dev/doc...
For each agent they define, they change the prompt depending on which model is being used to fit it.
I wonder how much of the "x did worse than y for the same prompt" tests could be improved if the prompts were actually tailored to what the model is good at.
I also wonder if any of this matters or if it's all a crock of bologna..
i think it may matter a good bit. i definitely have to write in different styles with different models (and catch myself doing so unintentionally) now that you mention it...
Fwiw I run this eval every week on a set of known prompts and I believe the in group differences are bigger than out group.
That is I get more variance between opus 4.6 and itself than I do between the sota models.
I don’t have the budget for statistical relevance but I’m convinced people claiming broad differences are just vibing, or there are times when agent features make a big difference.