Very cool stuff! It seems that all those solutions are based on the analysis of visual representations of spectrograms. Is this common or could you just use 2d arrays which encode the same information - would this be more performant?
I wrote up some of my experiments attempting to do what you are describing. I explain why you cant simply use a 2D array of an audiofile. You can find my post here:
I am by no means an expert in this area and a few people have since told me I did a few stupid things in my analysis. But you might find it interesting.
In this context, what's the difference between 'visual representations of spectrograms' and '2d arrays which encode the same information'? Algorithms don't have eyes. The way they 'see' is by reading '2d arrays'.
You mean 2d arrays containing the raw audio signal? No, this would not work because you do not know the phase along the y dimension when you want to compare to another signal.
Another method to detect an audio pattern is cross correlation on the raw audio signal. But it is very expensive in computation power and memory.
The longest operation with fingerprinting is often the DB query that is associated. Lots of work to do there. In that space, Will Drevo's work is really good. I will share my DB implementation later.
You have special fingerprint algorithms that are suited for sound modifications like pitch https://biblio.ugent.be/publication/5754913 but it's not going to work with humming or live audio. I don't know if such a thing exists.
As for 2d array spectrogram, it is not needed in my lib (expect when plotting is activated). I only care about maxima in the spectrum of each data window. In other words, 1d spectra are enough.
Spectrograms are a convenient way to visualize the data/algorithm but are rarely part of the actual analysis.
They are already using the 2d array so to speak.
In any case a spectrogram is just a 2d array where the magnitude of each array element is mapped to a color, so its effectively the same thing.
Few if any people use visual representations of sound for analysis, except for the crazies who run spectrograms though visual deep learning networks.
Uh, are you sure of what you are writing here? Time-frequency analysis (including spectrograms) is one of the very fundamental tools for signal processing.
True, i was thinking of a spectrogram as purely a visualization of a time-series of DFTs but Matlab and other tools do not make this distinction.
I was mainly responding to the OP's distinction between analyzing a visual representation and analyzing a "2d array" when they are basically the same thing.
> analyzing a visual representation and analyzing a "2d array" when they are basically the same thing.
This is what I mean. I guess their tooling just outputs graphics and it's easier to work with those than the pure 2d array in numpy or something similar.
http://jack.minardi.org/software/computational-synesthesia/
You can also see the code behind it here:
https://github.com/jminardi/audio_fingerprinting
I am by no means an expert in this area and a few people have since told me I did a few stupid things in my analysis. But you might find it interesting.