| This is a really broad topic. I began studying it about 5 years ago. Can you start by suggesting what you task you want to do? I'll throw out some suggestions, but you can say something different. Also you are welcome to email me (email in HN profile): * Voice conversion / singing voice conversion * Transcription of audio to MIDI * Classification / tagging of audio scene * Applying some effect / cleanup to audio * Separating audio into different instruments etc The really quick summary of audio ML as a topic is: * Often people treat it audio ML as vision ML, by using spectrogram representations of audio. Nonetheless, 1D models are sometimes just as good if not better, but they require very specific familiarity with the audio domain. * Audio distance measures (loss functions) are pretty crappy and not well-correlated with human perception. You can say the same thing about vision distance measures, but a lot more research has gone into vision models so we have better heuristics around vision stuff. With that said, multi-scale log mel spectrogram isn't that terrible. * Audio has a handful of little gotches around padding, windowing, etc. * DSP is a black art and DSP knowledge has high ROI versus just being dumb and black boxy about everything. |