| Hey HN, We built Composo because AI apps fail unpredictably and teams have no idea if their changes helped. LLM-as-judge doesn't work - it gives random scores, doesn't work well for agents, and doesn't tell you what to fix. We've built purpose-built evaluation models that give you:
- Deterministic scores (same input = same score, always)
- Instant identification of where prompts, retrievals, agents & tool calls fail
- Exact failure analysis ("tool calls are looping due to poorly specified schema") We're 92% accurate vs 72% for SOTA LLM-as-judge. Giving 10 startups free access:
- 10k eval credits
- Just launched our evals API for agents & tool calling
- 5 min setup Already helping teams at Palantir, Accenture, and Tesla ship reliable AI. Apply: composo.short.gy/startups Happy to answer questions about evaluation, reward models, or why LLMs are bad at judging themselves. startups@composo.ai |