| The music examples are utterly fascinating. It sounds insanely natural. The only thing I can hear that sounds unnatural, is the way that the reverberation in the room (the "echo") immediately gets lower when the raw piano sound itself gets lower. In a real room, if you produce a loud sound and immediately after a soft sound, the reverberation of the loud sound remains. But since this network only models "the sound right now", the volume of the reverberation follows the volume of the piano sound. To my ears, this is most prevalent in the last example, which starts out loud and gradually becomes softer. It sounds a bit like they are cross-fading between multiple recordings. Regardless, the piano sounds completely natural to me, I don't hear any artifacts or sounds that a real piano wouldn't make. Amazing! There are also fragments that sounds inspiring and very musical to my ears, such as the melody and chord progression after 00:08 in the first example. |
If you train NNs at the phrase level and overfit, then you get something that is indeed more or less the same as cross-fading at random between short sections.
Piano music is very idiomatic, so you'll capture some typical piano gestures that way.
But I'd be surprised if the music stays listenable for long. Classical music has big structures, and there's a difference between recognising letters (notes), recognising phrases (short sentences), recognising paragraphs (phrase structures), and parsing an entire piece (a novel or short story with characters and multiple plot lines.)
Corpus methods don't work very well for non-trivial music, because there's surprisingly little consistency at the more complex levels.
NN synthesis could be an interesting thing though. If you trained an NN on $sounds$ at various pitches and velocity levels, you might be able to squeeze a large and complex collection of samples into a compressed data set.
Even if the output isn't very realistic, you'd still get something unusual and interesting.