| Hey HN! I built SemanticTest while working on calendar0.app (an AI calendar assistant). While I was building the AI assistant, I noticed a lack on good AI Evals frameworks that would help me test my agent. SemanticTest uses GPT-4 as a judge to evaluate: - Text responses (semantic meaning) - Tool calls (correct tools, right order) - Multi-turn conversations It's composable: you build tests as JSON pipelines using custom blocks. Would love feedback.
Thank you! |