Hacker News new | ask | show | jobs
by p1esk 2614 days ago
How would you map these chunks to vectors?
1 comments

The same way we map words to vectors or entire pictures to vectors. We'll have another ML model that would take 1 second of sound as input (48000 1 byte numbers) and produce a say vector of 128 float32 numbers that would "describe" this 1 second of sound.
What would be an equivalent of a word for music?
1 second of sound. Or a few seconds of sound.
This would rule out such common mapping methods as word2vec, because unlike words, vast majority of 1 sec chunks of audio would be unique (or only repeating within a single recording).
That's fine. The goal is to map "similar" 1 second chunks to similar vectors. I'm sure this can be done and uniqueness of sound won't be a problem.
Sure, we can probably find a way to map two similar chunks to two similar vectors. However, with 1:1 mapping the resulting vectors will be just as unique. That's a problem, because, if you recall, we want to predict the next unit of music based on the units the model has seen so far. Training a model for this task requires showing it sequences of encoded units of music (vectors), where we must have many examples of how a particular vector follows a combination of particular vectors. If most of our vectors are unique, we won't have enough examples to train the model. For example, showing the model multiple examples of a phrase "I'm going to [some verb]", it will eventually learn that "to" after "I'm going" is quite likely, that a verb is more likely after "to" than an adjective, etc. This wouldn't have happened if the model saw 'going' or 'to' only once during training.