Hacker News new | ask | show | jobs
by bravura 668 days ago
If this is supposed to be used for deep-learning, shouldn't all the transforms be GPU-accelerated torch functions?
3 comments

By the looks of it, those functions extract features (like frequency peaks). You do that once for a sound. The output could function as input for an NN, in which case it would be a tokenizer for sound.
Given what I've seen in audio ML research:

1) Tuning hyperparameters of your audio preprocessing is a pain if it's a preprocessed CPU step. You have to redo preprocessing every time you want to tune your audio feature hyperparams

2) It's quite common to use torchaudio spectrograms, etc. purely because they are faster (I can link to a handful of recent high-impact audio ML github repos if you like)

3) If you use nnAudio, you can actually backprop the STFT or mel filters and tune them if you like. With that said, this is not so commonplace.

4) Sometimes the audio is GENERATED by a GPU. For example, in a neural vocoder, you decode the audio from a mel to a waveform. Then, you compute the loss over the true versus predict audio mel spectrograms. You can't do this with these C++ features. (Again, I can link a handful of recent high-impact audio ML github repos if you like.)

Again, I just don't get it.

>Again, I just don't get it.

The point is, ship it.

Seriously, nobody is lugging a GPU around to interact with their most frequently used micro-computing platform, their headphones, which right now, already represent a new and extraordinary era of "accelerated component" market expansion.

The 7 microphones in your earpiece, and the 6 speakers pushing air into your head, are not quite as close to the GPU, as they need to be, perhaps .. but they already have a DSP, and there is already a silicon battle going on among the vendors.

>You can't do this with these C++ features.

Yes, and I think the point in the end, is to use AI to write better C++ code, and design better, cheaper, smarter silicon, as always (and actually ship it) ..

> I can link to a handful of recent high-impact audio ML github repos if you like

Yes please :D

For instance:

https://github.com/descriptinc/descript-audio-codec/blob/mai...

https://github.com/NVIDIA/BigVGAN/blob/main/loss.py#L23

https://arxiv.org/pdf/2210.13438 (the github repo doesn't include training, just inference)

It is INCREDIBLY common to use multi-scale spectral loss as the audio distance / objective measure in audio generation. They have some issues (i.e. they aren't always well correlated with human perception) but they are the known-current-best.

Backpropping filter coefficients sounds clever, but can't you just do that on any layer that takes a spectrum as input?
Backpropping filter coefficients is clever, but it hasn't really caught on much. Google also tried with LEAF (https://github.com/google-research/leaf-audio) to have a learnable audio filterbank.

Anyway, in audio ML what is very common is:

a) Futzing with the way you do feature extraction on the input. (Oh, maybe I want CQT for this task or a different scale Mel etc)

b) Doing feature extraction on generated audio output, and constructing loss functions from generated audio features.

So, as I said, I don't exactly see the utility of this library for deep learning.

With that said, it is definitely nice to have really high speed low latency audio algorithms in C++. I just wouldn't market it as "useful for deep learning" because

a) during training, you need more flexibility than non-GPU methods without backprop

b) if you are doing "deep learning" then your inferred model will presumably be quite large, and there will be a million other things you'll need to optimize to get real-time inference or inference on CPUs to work well.

Is just my gut reaction. It seems like a solid project, I just question the one selling point of "useful for deep learning" that's all.

Are there resources you would recommend reading regarding ML and audio?
This is a really broad topic. I began studying it about 5 years ago.

Can you start by suggesting what you task you want to do? I'll throw out some suggestions, but you can say something different. Also you are welcome to email me (email in HN profile):

* Voice conversion / singing voice conversion

* Transcription of audio to MIDI

* Classification / tagging of audio scene

* Applying some effect / cleanup to audio

* Separating audio into different instruments

etc

The really quick summary of audio ML as a topic is:

* Often people treat it audio ML as vision ML, by using spectrogram representations of audio. Nonetheless, 1D models are sometimes just as good if not better, but they require very specific familiarity with the audio domain.

* Audio distance measures (loss functions) are pretty crappy and not well-correlated with human perception. You can say the same thing about vision distance measures, but a lot more research has gone into vision models so we have better heuristics around vision stuff. With that said, multi-scale log mel spectrogram isn't that terrible.

* Audio has a handful of little gotches around padding, windowing, etc.

* DSP is a black art and DSP knowledge has high ROI versus just being dumb and black boxy about everything.

A GPU is useful, but DSP's are also still useful - for example there is a compelling case to have frameworks around such as AudioFlux, JUCE and others, in order to support portability and also realtime analysis competitively, which is important in this domain, where such things as Qualcomms' ADK, and others, is quite literally being put inside peoples ears...

Not to say that big-AI shouldn't have audio analysis as a compelling sphere of application, but more that, until the chips arrive, in-ear AI is less of a specification/requirement, than in-ear DSP.

We don't need AI to isolate discrete audio components and do things with them, in-Ear. Offline/big-AI, however, is still compelling. But we don't yet have GPU neckbands ..

Maybe for the convenience of mobile usage?