| Hi HN, We’re open-sourcing the Go orchestrator we built at Lokutor (https://github.com/lokutor-ai/lokutor-orchestrator). Building a voice agent that feels like a human is 20% model quality and 80% orchestration. The "standard" approach—daisy-chaining STT, LLM, and TTS APIs—usually results in a 2-3 second delay that kills the conversation. We also found that implementing "Barge-in" (the ability to interrupt the bot) is surprisingly tricky to get right across multiple streaming providers. We chose Go because voice orchestration is essentially a high-concurrency plumbing problem. You’re managing several bidirectional streams (WebSockets/gRPC) while calculating RMS for VAD (Voice Activity Detection) and managing a state machine that needs to respond in milliseconds when it detects user speech. What’s inside: Full-Duplex: Capture and playback occur simultaneously without audio feedback loops.
Native Barge-in: When the user starts speaking, the orchestrator immediately kills the LLM generation and clears the TTS audio buffers.
Built-in RMS VAD: Thread-safe voice activity detection out of the box.
Provider Agnostic: Swap between Groq, OpenAI, Deepgram, Anthropic, and our own Versa engine.
Minimal Latency: Designed to add <10ms of overhead on top of the provider latencies.
We've used this to build agents that handle sub-500ms end-to-end response times. We would love to hear your feedback on the architecture, especially regarding how we handle the ManagedStream state machine. GitHub: https://github.com/lokutor-ai/lokutor-orchestrator Docs: https://pkg.go.dev/github.com/lokutor-ai/lokutor-orchestrato... |