| These models aren't actually doing any musical comparison -- they are trained on audio, and from audio, piece apart with a "note" and a "melody" and an "instrument" are from the labelled training data. No intentional theory is being done! Algorithmic music composition has usually been split into two: 1. Generate notes (re: theory, genre) 2. Generate sound (i.e., EMI[0], Kulitta[1], MusicNet[2]) Now we are doing both at the same time, and backwards. The model isn't (necessarily) going "write melody, then generate the sound", but rather, "here are 500 songs that are described with X, 500 with Y, and you want XY, so we'll combine these two" :) (This is my best understanding, so feel free to correct) [0]: http://artsites.ucsc.edu/faculty/cope/experiments.htm [1]: https://hackage.haskell.org/package/Kulitta [2]: https://zenodo.org/record/5120004 |
The problem is that coherent musical structures are much more constrained. You can't just XYZ... into a space and get something that makes sense.
That will kind of work for low-density music, which includes a lot of landfill dance + subgenres. But these statistical models are blind to larger and more complex structures, and completely unaware of cultural context and semantics.
It's actually a harder problem than language modelling because the spaces and the grammars are much larger, especially once you start including sound quality and production values as well as arrangement and core composition.