Hacker News new | ask | show | jobs
by iverjo 3484 days ago
To the author: Have you tried to use a logarithmic frequency scale in the spectrogram? [1] That representation is closer to the way humans perceive sound, and gives you finer resolution in the lower frequencies. [2] If you want to make your representation even closer to the human's perception, take a look at Google's CARFAC research. [3] Basically, they model the ear. I've prepared a Python utility for converting sound to Neural Activity Pattern (resembles a spectrogram when you plot it) here: https://github.com/iver56/carfac/tree/master/util

[1] https://sourceforge.net/p/sox/feature-requests/176/

[2] https://en.wikipedia.org/wiki/Mel_scale

[3] http://research.google.com/pubs/pub37215.html

3 comments

Mel scale spectrograms are the approach taken in a research paper which uses roughly the same technique as is described in this post: https://dl.dropboxusercontent.com/u/19706734/paper_pt.pdf
I don't think this problem is bound by absolute frequency resolution, the tightest distance between two notes on a typical piano is ~2hz and if you assume a doubling between octaves you're at <90 notes. The temporal changes and relative chord progressions probably give more info.
Thanks for your insights! I agree that log/mel spectrograms could be even more detailed and effective, and could be used with the SoX patch discussed here https://sourceforge.net/p/sox/feature-requests/176/.
I didn't intend to go that far in the human genre recognition parallel, but thanks for the references ! Good job on the script too