Hacker News new | ask | show | jobs
by augustl 3577 days ago
The music examples are utterly fascinating. It sounds insanely natural.

The only thing I can hear that sounds unnatural, is the way that the reverberation in the room (the "echo") immediately gets lower when the raw piano sound itself gets lower. In a real room, if you produce a loud sound and immediately after a soft sound, the reverberation of the loud sound remains. But since this network only models "the sound right now", the volume of the reverberation follows the volume of the piano sound.

To my ears, this is most prevalent in the last example, which starts out loud and gradually becomes softer. It sounds a bit like they are cross-fading between multiple recordings.

Regardless, the piano sounds completely natural to me, I don't hear any artifacts or sounds that a real piano wouldn't make. Amazing!

There are also fragments that sounds inspiring and very musical to my ears, such as the melody and chord progression after 00:08 in the first example.

5 comments

I can hear some distortion in the piano notes - which may be an audio compression artefact, or it may be the output of the resynthesis process.

If you train NNs at the phrase level and overfit, then you get something that is indeed more or less the same as cross-fading at random between short sections.

Piano music is very idiomatic, so you'll capture some typical piano gestures that way.

But I'd be surprised if the music stays listenable for long. Classical music has big structures, and there's a difference between recognising letters (notes), recognising phrases (short sentences), recognising paragraphs (phrase structures), and parsing an entire piece (a novel or short story with characters and multiple plot lines.)

Corpus methods don't work very well for non-trivial music, because there's surprisingly little consistency at the more complex levels.

NN synthesis could be an interesting thing though. If you trained an NN on $sounds$ at various pitches and velocity levels, you might be able to squeeze a large and complex collection of samples into a compressed data set.

Even if the output isn't very realistic, you'd still get something unusual and interesting.

The samples are uncompressed WAV files, so everything you hear is a direct result of the synthesis process. Some of the distortion is a result of the 16kHz sample rate-- it's not 44.1kHz CD quality.
It's quantized to just 256 values though, which could be causing some of the distortion.
It shot me forward to a time where people just click a button to generate music they want to listen to. If you really like the generation, you save it and share it. It wouldn't have all of the other aspects that we derive from human-produced music like soul/emotion (because we know it's coming from a human, not because of how it sounds), but it would be a cool application idea anyway.
Have you tried https://www.jukedeck.com ? AI composed music at the touch of a button.
This reminds me of the Library of Babel short story.
I agree, the samples sound very natural. I ask myself though how similar they are to the data that has been used for training, as it would be trivial to rearrange individual pieces of a large training set in ways that sound good (especially if a human selects the good samples for presentation afterwards).

What I'd really like to see therefore is a systematic comparison of the generated music to the training set, ideally using a measure of similarity.

A nice property of the model is that it is easy to compute exact log-likelihoods for both training data and unseen data, so one can actually measure the degree of overfitting (which is not true for many other types of generative models). Another nice property of the model is that it seems to be extremely resilient to overfitting, based on these measurements.
Good point! Are (some of) the chords completely made up, for example, or is it only using chords it has heard before?
Filtering out certain notes from a piano chord can be done by e.g. Melodyne, but that seems far from what's necessary to generate speech, so it would surprise me, if WaveNet can do that?
Decades ago, I was testing a LPC-10 vocoder. I discovered many new and strange sounds by playing with the input mike, such as blowing into it, or rubbing it. Like the LPC-10, I wonder about untapped musical possibilities that this allows.
That seems completely tractable by simply adding a bit of the right reverb to the generated sample, more or less "in post".
Good point! Just train it with recordings that has no reverberation, and add it later.
It's quite difficult to have no reverberation, but not too bad at all to keep to a minimum. But reverb plus reverb equals reverb, so it's just a matter of finding one that sounds good.

It'd also be interesting to know if this technique could solve the "de-reverberation" problem.