| The specific feature vector statement doesn’t hold for audio (at least). The time dimension adds complexity to the problem as the optimal values for the perturbation vary depending on both the immediately surrounding values, and many of the values beforehand. When I say “hello world”, the fact I said “e” depends on the fact I said “h”. “L” depends on both “e” and “h”... etc etc. Adds an extra dimension to the problem. Also, distance metrics for images aren’t ideal for audio, for many reasons. That’s why audio signal processing is a different sub field vs image processing. The approaches are similar, but we have to use different things in the end because audio behaves differently to images. Eg feature extraction through MFCC is a variant of Fourier, but specifically tailored for the human ear. E.g. Lea Schonherr et al.’s really good Psychoacoustic attack paper. On the negation of attacks through transforms - important to remember that an ensemble of weak defences are not strong. Many attacks have been shown to be robust to simple transformations. |