Hacker News new | ask | show | jobs
by woodson 749 days ago
Replacing the codebook approach with a statistical/DNN is more likely to give higher accuracy than getting rid of mfccs as spectral representation (at least in general ASR). (Arguably, using Mel spectra was the least controversial design choice made for Whisper.)
1 comments

thank you! those are good points. i was thinking that maybe you could get by with some relatively sparse convolutional layers over the raw sound samples and save yourself the expense of doing a real fourier transform, but maybe that's a dumb idea
It is a good idea that is worth trying out! Like anything there are tradeoffs though, so it is not guaranteed to be better for this particular circumstance. The ability to use low bitdepth integer operations (which easy for a neural net) should be beneficial for a CPU without a floating point unit. But weights need to be stored - and it can be difficult to match FFT efficiency - depending on what resolution is actually needed/utilized.