In theory , yes we could, but would it yield "good enough" results for a "testing" agent- Probably not. The LLM here is actually not just responsible for tool calling, its also doing other intricate things such as planning the next steps based on the input feature file, and generating the browser/API automation code. In our experiments we found that OpenAI 4o performs best, followed by Haiku or Grok.