| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pdxww 2613 days ago
	The same way we map words to vectors or entire pictures to vectors. We'll have another ML model that would take 1 second of sound as input (48000 1 byte numbers) and produce a say vector of 128 float32 numbers that would "describe" this 1 second of sound.

1 comments

p1esk 2613 days ago

What would be an equivalent of a word for music?

link

pdxww 2612 days ago

1 second of sound. Or a few seconds of sound.

link

p1esk 2612 days ago

This would rule out such common mapping methods as word2vec, because unlike words, vast majority of 1 sec chunks of audio would be unique (or only repeating within a single recording).

link

pdxww 2612 days ago

That's fine. The goal is to map "similar" 1 second chunks to similar vectors. I'm sure this can be done and uniqueness of sound won't be a problem.

link

p1esk 2612 days ago

Sure, we can probably find a way to map two similar chunks to two similar vectors. However, with 1:1 mapping the resulting vectors will be just as unique. That's a problem, because, if you recall, we want to predict the next unit of music based on the units the model has seen so far. Training a model for this task requires showing it sequences of encoded units of music (vectors), where we must have many examples of how a particular vector follows a combination of particular vectors. If most of our vectors are unique, we won't have enough examples to train the model. For example, showing the model multiple examples of a phrase "I'm going to [some verb]", it will eventually learn that "to" after "I'm going" is quite likely, that a verb is more likely after "to" than an adjective, etc. This wouldn't have happened if the model saw 'going' or 'to' only once during training.

link

pdxww 2611 days ago

Can we diff spectrograms to define the "distance" between two chunks of sound and use this measure to guide the ML learning process?

Would it help to decompose sound into subpatterns with Fourier transform?

Afaik, there is a similar technique for recognizing faces: a face picture is mapped to a "face vector". Yet this technique doesn't need the notion of "sequence of faces" to train the model. Can we use it to get "sound vectors"?

link