| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by draismaa 105 days ago

We had so many successfull stories with the LangWatch MCP server, an MCP integration that brings agent evaluation infrastructure directly into Claude Code, Cursor, and any MCP-compatible environment. That i had to share some of the successes here:

The problem it's solving: teams building AI agents are fully in their coding assistant, but evaluation still requires logging into a separate platform, learning a new UI, and context-switching. The MCP closes that gap. What you can do from within your editor:

Ask your AI assistant to instrument your existing code with LangWatch tracing (it fetches the docs, adds imports, wraps functions with @langwatch.trace()) Generate simulation-based agent tests using Scenario — describe the behavior in plain English and it writes the pytest file

Search and inspect live traces from your project without touching the dashboard

Version and sync prompts to LangWatch's registry

Query cost/latency analytics in natural language

Set up LLM-as-a-judge evaluators that can gate CI/CD

Three real-world cases from the blog post:

A PM at an HR/payroll platform generated 63 agent test scenarios across 11 categories (happy paths, edge cases, wage tax mutations) in a single Claude conversation — no code written by hand.

A Senior AI Engineer migrated an entire Langfuse implementation to LangWatch in one session: Claude read the existing integration, rewired tracing, converted Jinja prompts to versioned YAML, and scaffolded model benchmarking notebooks comparing GPT-4o, Gemini, and Anthropic models.

A Dutch government AI team (LangGraph, multi-agent grant assessment system) used the MCP to build a full testing pyramid: end-to-end scenario tests, model comparison notebooks, and CI-gated quality evaluators before they'd written a single line of eval code themselves.

Setup is one line: claude mcp add langwatch -- npx -y @langwatch/mcp-server --apiKey your-key Docs: https://langwatch.ai/docs/integration/mcp

Curious if others are building MCP-powered eval workflows. The self-instrumenting agent angle (agents setting up their own observability while being built) is something we've been exploring and it gets weird fast.