Hacker News new | ask | show | jobs
Show HN: Rehearse – a pytest like testing library for voice agents
1 points by djp2803 139 days ago
I was manually calling my Twilio voice agent 100 times a day to verify every single micro change.

Tired of that, I built Rehearse.

I know there is a lot of YC money going into voice testing companies, but I wanted to build something open source and code first so Claude Code can spin up and manage test cases.

Example usage:

- call.listen() -> get audio or transcript of what the agent is saying

- call.say("I'd like to book a table for 2 at midnight") -> speak with the agent

- assertions on responses

It only supports Twilio (my use case) and ElevenLabs (transcription), with basic text and LLM based assertions for now.

It makes real calls and is BYOK.

I have a bunch of ideas in mind (not implemented yet, not sure if useful):

1. simulations like accents, background noise, languages, network issues, interruptions, etc

2. voice agent testing another voice agent

3. native audio based assertions

4. more connection options like Vapi, Retell, Websockets etc

GitHub https://github.com/thenullterminator/rehearse

PyPI https://pypi.org/project/rehearse/

Everything is a bit janky right now.

Appreciate all your feedback!

2 comments

This is neat. A couple test cases that have bitten us on real voice agent deployments (beyond noise/accents):

- Barge-in / interruption: user starts talking mid-agent-sentence, agent should stop + recover state. - DTMF flows + mixed-mode ("press 1", then spoken intent). Also: false DTMF (ASR hears "one" as tone). - Silence / dead air / voicemail: detect long silence, prompt once, then gracefully end; detect voicemail greeting. - Transfers: warm vs cold transfer, verifying you actually bridged the call + preserving context. - Telephony weirdness: jitter/packet loss, codec changes (PCMU vs OPUS), partial transcripts, delayed ASR. - Guardrails: PII capture + confirmation, profanity de-escalation, "agent must not comply" tests.

One UX thought: record/replay (store the raw audio + timing) so regressions are deterministic and you can run “golden” call fixtures in CI without placing a real call every time.

(We build production voice agents at eboo.ai; happy to share a small bundle of “gotcha” scenarios if useful.)

Thanks Pranay for sharing this!, how are you testing your agent deployments today? is it vibe testing or automated?
Ideal UX:

    @pytest.mark.asyncio
    async def test_agent_handles_profanity():

        async with VAPICall(
            phone_number="+15551234567",
            api_key="your-bland-key",
            background_noise=BackgroundNoise.TRAFFIC,
            noise_level=0.4,
            speaking_style=SpeakingStyle(
                accent="american",
                speed=1.4,  # Speaking fast when angry
            ),
        ) as call:

            await call.listen()  # Greeting
    
            await call.say(
                "This is bullshit, I want to speak to a manager!",
                emotion="angry",
            )
            response = await call.listen()
    
            # Agent should remain professional and de-escalate
            await expect(response).to_satisfy(
                "remains calm and professional",
                "does not mirror the profanity",
                "offers to escalate or resolve the issue",
                llm=judge
            )
            expect(response.audio).to_not_have_emotion("angry")
            expect(response.latency).to_be_less_than(2.5)