| Our biggest cost multiplier was "conversational drift" - not the initial call, but what happens when you let users iterate. In our email marketing tool, a user might say "make it more punchy" → AI rewrites → "actually, more professional" → rewrite → "can we A/B test both versions?" → now you're generating multiple variants. One "simple" email could spiral into 15+ LLM calls. What worked for us: 1. *Session-level budgets, not request-level.* We cap total tokens per session rather than per call. Users can iterate freely within their budget, but can't inadvertently 10x their usage. 2. *Explicit "done" signals.* Instead of letting users endlessly refine, we added a clear "I'm happy with this" button that closes the generation loop. Sounds UX-y but it cut our average calls-per-task by 60%. 3. *Cascade to cheaper models for iteration.* First generation uses Claude 3.5. Tweaks and refinements use Haiku. Users can't tell the difference for small edits, and it cut iteration costs ~80%. 4. *Cache aggressively at the semantic level.* "Make it shorter" and "condense this" should hit the same cache key. We use embeddings to identify semantically similar requests and serve cached results when possible. The counterintuitive insight: your biggest cost driver is probably user behavior, not model choice. The difference between GPT-4 and Claude matters less than how you architect the interaction loop. |