Fascinating! How did you decouple the speaker-specific vocal characteristics (timbre, pitch range) from the accent-defining phonetic and prosodic features in the latent space?
We didn't explicitly. Because we finetuned this model for accent classification, the later transformer layers appear to ignore non-accent vocal characteristics. I verified this for gender for example.