Hacker News new | ask | show | jobs
by jacquesm 2351 days ago
This is a good starting point but it ends just when things get interesting. If you are going to process audio for ML make sure you experiment with normalizing the input volume, this can make a huge difference and try if your inputs are in stereo to process both mono, single channel and stereo inputs to see which one performs better.

Finally, if you pre-process the audio using an FFT try different FFT sizes.

2 comments

It might be good to understand how changing the hop and window size affect the analysis so you're not blindly changing settings.

The trade off for window size is frequency resolution and time resolution. A bigger window gives you narrower bands, so more frequency resolution while giving you less temporal resolution where an onset of transient is significant in the analysis. Similarly, hop size will determine how 'leaky' the process is and how fine grained the windows will be. This can effect detecting quick peaks or changes while possibly smearing them across a few windows.

Hmmm, that makes sense to me for offline / not real time analysis, but I'd be interested to know how that would affect things in real time. I guess using some sort of DRC/limiter could be used to attempt to "normalize" the incoming audio