| I work on real-time voice/video AI at Tavus and for the past few years, I’ve mostly focused on how machines respond in a conversation. One thing that’s always bothered me is that almost all conversational systems still reduce everything to transcripts, and throw away a ton of signals that need to be used downstream. Some existing emotion understanding models try to analyze and classify into small sets of arbitrary boxes, but they either aren’t fast / rich enough to do this with conviction in real-time. So I built a multimodal perception system which gives us a way to encode visual and audio conversational signals and have them translated into natural language by aligning a small LLM on these signals, such that the agent can "see" and "hear" you, and that you can interface with it via an OpenAI compatible tool schema in a live conversation. It outputs short natural language descriptions of what’s going on in the interaction - things like uncertainty building, sarcasm, disengagement, or even shift in attention of a single turn in a convo. Some quick specs: - Runs in real-time per conversation - Processing at ~15fps video + overlapping audio alongside the conversation - Handles nuanced emotions, whispers vs shouts - Trained on synthetic + internal convo data Happy to answer questions or go deeper on architecture/tradeoffs More details here: https://www.tavus.io/post/raven-1-bringing-emotional-intelli... |
Another concern I’d have is bias. If I am prone to speaking loudly, is it going to say I’m shrill? If my camera is not aligned well, is it going to say I’m not making eye contact?