| You're running RAG in production. Then the AWS bill lands. $2,400/month for 50 queries/day. $48 per query. We built a RAG system for enterprise clients and realized most production RAGs are optimization disasters. The literature obsesses over accuracy while completely ignoring unit economics. The Three Cost Buckets
Vector Database (40-50% of bill)
Standard RAG pipelines do 3-5 unnecessary DB queries per question. We were making 5 round-trips for what should've been 1.5. LLM API (30-40%)
Standard RAG pumps 8-15k tokens into the LLM. That's 5-10x more than necessary. We found: beyond 3,000 tokens of context, accuracy plateaus. Everything beyond that is noise and cost. Infrastructure (15-25%)
Vector databases sitting idle, monitoring overhead, unnecessary load balancing. What Actually Moved the Needle
Token-Aware Context (35% savings)
Budget-based assembly that stops when you've used enough tokens. Before: 12k tokens/query. After: 3.2k tokens. Same accuracy. python
def _build_context(self, results, settings):
max_tokens = settings.get("max_context_tokens", 2000)
current_tokens = 0
for result in results:
tokens = self.llm.count_tokens(result)
if current_tokens + tokens <= max_tokens:
current_tokens += tokens
else:
break
Hybrid Reranking (25% savings)
70% semantic + 30% keyword scoring. Better ranking means fewer chunks needed. Top-20 → top-8 retrieval while maintaining quality. Embedding Caching (20% savings)
Workspace-isolated cache with 7-day TTL. We see 45-60% hit rate intra-day. python
async def set_embedding(self, text, embedding, workspace_id=None):
key = f"embedding:ws_{workspace_id}:{hash(text)}"
await redis.setex(key, 604800, json.dumps(embedding))
Batch Embedding (15% savings)
Batch API pricing is 30-40% cheaper per token. Process 50 texts simultaneously instead of individu |