Strange that they are feeding raw audio in. Even in humans, there is a hardware transform to the frequency domain (the cochlea) before data is fed to the brain, effectively doing this part in the LLM seems inefficient.
The FFT is essentially just a matrix multiplication, or two. No need for fancy conversions. Just a huge amount of training data and a very large array.