We can ask the vision based models to output why they are doing what they are doing, and fallback to code-based approaches for subsequent runs