Hacker News new | ask | show | jobs
by Soerensen 135 days ago
Our biggest cost multiplier was "conversational drift" - not the initial call, but what happens when you let users iterate.

In our email marketing tool, a user might say "make it more punchy" → AI rewrites → "actually, more professional" → rewrite → "can we A/B test both versions?" → now you're generating multiple variants. One "simple" email could spiral into 15+ LLM calls.

What worked for us:

1. *Session-level budgets, not request-level.* We cap total tokens per session rather than per call. Users can iterate freely within their budget, but can't inadvertently 10x their usage.

2. *Explicit "done" signals.* Instead of letting users endlessly refine, we added a clear "I'm happy with this" button that closes the generation loop. Sounds UX-y but it cut our average calls-per-task by 60%.

3. *Cascade to cheaper models for iteration.* First generation uses Claude 3.5. Tweaks and refinements use Haiku. Users can't tell the difference for small edits, and it cut iteration costs ~80%.

4. *Cache aggressively at the semantic level.* "Make it shorter" and "condense this" should hit the same cache key. We use embeddings to identify semantically similar requests and serve cached results when possible.

The counterintuitive insight: your biggest cost driver is probably user behavior, not model choice. The difference between GPT-4 and Claude matters less than how you architect the interaction loop.