In general, for realtime voice AI you don't want this model to support multiple speakers because you have a separate voice input stream for each participant in a session.
We're not doing "speaker diarization" from a single audio track, here. We're streaming the input from each participant.
If there are multiple participants in a session, we still process each stream separately either as it comes in from that user's microphone (locally) or as it arrives over the network (server-side).
We're not doing "speaker diarization" from a single audio track, here. We're streaming the input from each participant.
If there are multiple participants in a session, we still process each stream separately either as it comes in from that user's microphone (locally) or as it arrives over the network (server-side).