| Hey, I'm Mitch (post author). I'm a 15+ year full-stack engineer who contracts with Render and guest-posted my research for their blog. Up until recently, I was very skeptical of AI coding tools. My AI usage was basically GH Copilot's autocomplete. After cleaning up too many agent mistakes in production, I set up a structured trial in my real projects to measure what these agents can do under real constraints. I decided to run two sets of experiments: vibe coding a new application from scratch as a control test, then giving the agents real production tasks. For the production tasks, I gave them backend challenges like building a k8s pod leader election system in Go microservices, and building out CSS templates in Astro.js. I evaluated Cursor, Claude Code, Gemini CLI, and OpenAI Codex across setup friction, # of follow up prompts, code quality, UX, and context handling. Cursor won but it was close. I really liked the Claude Code UX and will try the new Cursor CLI. I plan to run a similar benchmark in the fall using newer features like parallel agents and newer models (maybe GPT-5 or whatever comes next). Let me know what I should test for round 2 or nitpick the criteria I used. The best tool for you might not be the best tool for me, so I encourage you to run your own experiments. Also happy to answer questions about my methodology. |