|
|
|
Show HN: Clawphone – Twilio voice/SMS gateway for AI agents using TwiML polling
(github.com)
|
|
2 points
by ranacseruet
121 days ago
|
|
clawphone bridges Twilio phone calls and SMS to an OpenClaw AI agent using plain TwiML webhooks — no WebSocket server, no external STT/TTS APIs (OpenAI, ElevenLabs, etc.), no audio encoding pipeline.
The design intentionally trades latency for operational simplicity. When a call comes in, Twilio handles speech-to-text via <Gather input="speech">, the agent runs async, and the reply is polled and spoken back via <Say>. It adds a couple seconds of round-trip vs. a Media Streams approach, but you only need one Twilio account and one Node process.
Features: Standalone server (Node / PM2) or OpenClaw plugin mode
SMS support with fast-path (sync) and async fallback
Twilio webhook signature validation
Per-number rate limiting
Graceful shutdown with in-flight voice call drain
Structured JSON logging + optional Discord channel logging
166 tests using Node's built-in node:test (no external framework) It's zero-dependency at the HTTP layer — raw node:http, ES Modules only.
I built this because the official OpenClaw voice plugin requires a WebSocket gateway + external TTS/STT accounts. For a personal assistant or low-traffic deployment, that's a lot of infrastructure. This is the minimal path.
GitHub: https://github.com/ranacseruet/clawphone
npm: @ranacseruet/clawphone
Happy to answer questions about the architecture or the TwiML polling approach. |
|
A couple questions / thoughts from building voice agents in production:
- How do you handle barge‑in / interruptions? With <Gather input="speech"> + polling, it’s hard to do true full‑duplex + partial ASR. Have you considered a hybrid mode where you keep the TwiML simplicity for setup, but optionally switch to <Stream> (Media Streams) when people want sub‑second turn-taking? - Twilio’s built-in speech recog is convenient, but in my experience it can be the first thing teams outgrow (accuracy, language coverage, costs, and lack of token-level partials). Do you expose an interface so people can swap STT later without reworking the call control? - For long agent responses: do you chunk <Say> / keep call alive with <Pause>? Any gotchas around Twilio timeouts while the agent is “thinking”?
We’ve run into the same infra-vs-latency tradeoff at eboo.ai (real-time voice agents / telephony + WebRTC). If you ever want a sanity check on the lowest-latency Twilio path (Media Streams + incremental STT + barge-in), happy to compare notes.