| I'm Ben — I built UCP Playground (https://ucpplayground.com), the testing tool mentioned in the post. I can confirm the Saturday night debugging session. We were chatting while patching things live on both sides for hours. A few things I can add from the Playground side: On schema quality > model capability: This is the biggest takeaway from 180+ sessions across 11 models. We have a leaderboard tracking completion rates per model, and the variance between stores is larger than the variance between models. A clean store schema makes even weaker models succeed. A messy schema makes even Claude and GPT fail. The store is the bottleneck, not the LLM. On variant resolution: The agents that skip get_product_details and try to cart directly almost always fail. They guess variant IDs from search results, hit type errors, and drop off. Llama 3.3 70B's higher success rate comes down to it reliably calling the details endpoint first — it follows the tool sequence more faithfully than models that try to shortcut. On the three transports: UCP Playground tests MCP, REST, and ECP (embedded checkout via iframe). The REST fallback turned out to be critical — MCP connections to WooCommerce stores can be flaky (OOM on unbounded queries, rate limits), so the agent silently falls back to REST for that tool call and continues. Without that, a lot of sessions would just die at search. On store instructions: One thing we surface in Playground that merchants don't always realize — stores can inject response_instructions into MCP tool responses. These are behavioral prompts that tell the agent how to present products. We extract and display them per tool call. It's worth auditing what your store is telling agents to do. If anyone wants to see what a real agent session looks like, Playground lets you share replays as public links. Happy to answer questions about the testing/observability side. |