Hacker News new | ask | show | jobs
by pdxww 2613 days ago
The same way we map words to vectors or entire pictures to vectors. We'll have another ML model that would take 1 second of sound as input (48000 1 byte numbers) and produce a say vector of 128 float32 numbers that would "describe" this 1 second of sound.
1 comments

What would be an equivalent of a word for music?
1 second of sound. Or a few seconds of sound.
This would rule out such common mapping methods as word2vec, because unlike words, vast majority of 1 sec chunks of audio would be unique (or only repeating within a single recording).
That's fine. The goal is to map "similar" 1 second chunks to similar vectors. I'm sure this can be done and uniqueness of sound won't be a problem.
Sure, we can probably find a way to map two similar chunks to two similar vectors. However, with 1:1 mapping the resulting vectors will be just as unique. That's a problem, because, if you recall, we want to predict the next unit of music based on the units the model has seen so far. Training a model for this task requires showing it sequences of encoded units of music (vectors), where we must have many examples of how a particular vector follows a combination of particular vectors. If most of our vectors are unique, we won't have enough examples to train the model. For example, showing the model multiple examples of a phrase "I'm going to [some verb]", it will eventually learn that "to" after "I'm going" is quite likely, that a verb is more likely after "to" than an adjective, etc. This wouldn't have happened if the model saw 'going' or 'to' only once during training.
Can we diff spectrograms to define the "distance" between two chunks of sound and use this measure to guide the ML learning process?

Would it help to decompose sound into subpatterns with Fourier transform?

Afaik, there is a similar technique for recognizing faces: a face picture is mapped to a "face vector". Yet this technique doesn't need the notion of "sequence of faces" to train the model. Can we use it to get "sound vectors"?