| HN Mirror

Great question! It makes it more interesting! New attack angles are presented when dealing with the speech-to-speech models. Prosody, which are the intonation patterns that convey meaning, emotion, and emphasis beyond the literal words, comes into play! We have observed soft-spoken, gentle, and unsure requests often outperform authoritative statements in these systems. They also introduce potential attack surface such as background noises or phrases spoken as asides (like speaking to another person in the room) can impact the models understanding. This documentation started from testing a speech-to-speech model. You bring up an excellent point though. We will need to go back and re-frame this documentation to highlight the differences between testing TTS vs STS systems with some pointers on how to detect which type of system you are interacting with. Thanks for the question!