Whereas the benchmark gains seem by new OpenAI, Grok and Claude models don't feel accompanied by vibe improvement