Try StyleTTS2. You will still have to experiment with the settings a little to get the right level of adherence to the reference speaker’s voice and the emotion content.
Without looking at this, are you sure that this can do speech to speech? Maybe my flaw in searching has been disregarding anything that's called "text to speech" as not also "speech to speech"?