It happens quite often with TorToiSe that it collapses in this way. Especially for unseen tokens that wouldn't have appeared in the training data, which likely consisted of a lot of transcribed stuff and read text like audio books. Trying to make it laugh by prompting it with "hahaha" (which you won't really see in mentioned data) often gets you demon and zombie noises.
It uses the TorToiSe TTS model for generation. It's simple to generate conditioning voice latents using short audio samples. Likely transcribed JRE episodes were part of the TorToiSe training data, explaining how it's so good at recreating his voice characteristics in particular.
That generation uses tortoise-tts. Play.ht has a model called peregrine, I've taken to using a script to call them out. Super cool company & API. I just haven't had time to get my next gen version out.