I wonder how it would come across with the right voice. We're focused on building out the video layer tech, but at the end of the day, the voice is also pretty important for a positive experience.
I wonder if we should make the UI a more common interface (e.g. "the call is ringing") to avoid this confusion?
It's a normal mp4 video that's looping initially (the "welcome message") and then as soon as you send the bot a message, we connect you to a GPU and the call becomes interactive. Connecting to the GPU takes about 10s.
Makes sense. The init should be about 10s. But, after that, it should be real time. TBH, this is probably a common confusion. So thanks for calling it out.