Hacker News new | ask | show | jobs
by antognini 1289 days ago
I've done some work on AI audio synthesis and the artifacts you're hearing in these clips are coming from the algorithm that is used to go from the synthesized spectrogram to the audio (the Griffin-Lim algorithm).

Audio spectrograms have two components: the magnitude and the phase. Most of the information and structure is in the magnitude spectrogram so neural nets generally only synthesize that. If you were to look at a phase spectrogram it looks completely random and neural nets have a very, very difficult time learning how to generate good phases.

When you go from a spectrogram to audio you need both the magnitudes and phases, but if the neural net only generates the magnitudes you have a problem. This is where the Griffin-Lim algorithm comes in. It tries to find a set of phases that works with the magnitudes so that you can generate the audio. It generally works pretty well, but tends to produce that sort of resonant artifact that you're noticing, especially when the magnitude spectrogram is synthesized (and therefore doesn't necessarily have a consistent set of phases).

There are other ways of using neural nets to synthesize the audio directly (Wavenet being the earliest big success), but they tend to be much more expensive than Griffin-Lim. Raw audio data is hard for neural nets to work with because the context size is so large.

6 comments

Phase is crtical for pitch. Here is why. The spectral transformation breaks up the signal into frequency bins. The frequency bins are not accurate enough to convey pitch properly. When a periodic signal is put through a FFT, it will land into a particular frequency bin. Say that the frequency of the signal is right in the middle of that bin. If you vary its pitch a little bit, it will still hand into the same bin. Knowing the amplitude of the bin doesn't give you the exact pitch. The phase information will not give it to you either. However, between successife FFT samples, the phase will rotate. The more off-center the frequency is, the more the phase rotates. If the signal is dead center, then each successive FFT frame will show the same phase. When it is off center, the waveform shifts relative to the window, and so the phase changes for every sample. From the rotating phase, you can determine the pitch of that signal with great accuracy.
Yes, this is exactly right and is why Griffin-Lim generated audio often has a sort of warbly quality. If you use a large FFT you can mitigate the issues with pitch because the frequency resolution in your spectrogram is higher, so the phase isn't so critical to getting the right pitch. But the trade-off of a bigger FFT is that the pitches now have to be stationary for longer.

The other place where phase is critical is in impulse sounds like drum beats. A short impulse is essentially just energy over a broad range of frequencies, but the phases have been chosen such that all the frequencies cancel each other out everywhere except for one short duration where they all add constructively. Without the right phases, these kinds of sounds get smeared out in time and sound sort of flat and muffled. The typing example on their demo page is actually a good example of this.

So what is phase? From dabbling with waveforms in audio editors, sampling, and later learning a little bit about complex numbers, phase seems eventually equivalent to what would sound like changing pitch, modulating the frequency of a periodic signal.

The simplest demonstration of it is the doppler shift. But it's not at all that simple because moving relative to the source the sound pressure and thus the perceived loudness also change, distorting the wave form, thereby introducing resonant frequencies. Now imagine that the transducer is always moving, eg. a plucked string.

The ideal harmonic pendulum swings periodically, only losing attenuation. But the resonant transducer picks up reflections of its own signal, like coupled pendulums, which are intractable according to the three body problem.

On top of that, our hearing is fine tuned to voices and qualities of noise.

Phase is the offset in time. The functions sin(θ) and sin(θ + c), for arbitrary real c, represent the same frequency signal; they are offset from each other horizontally by c, and that c is a phase difference. It has an interpretation as an angle, when the full cycle of the wave is regarded as degrees around a circle; and that's what I mean by rotating phase.

When you take a window of samples of a signal, and run the FFT on it, for every frequency bin, the calculation determines what is the amplitude and phase of the signal. If you have a frequency bin whose center is 200 Hz, and there is a 200 Hz signal, then what you get for that frequency bin is a complex number. The complex number's magnitude ("modulus") is the amplitude of that signal, and its angle ("argument"d) is the phase.

If the signal is exactly 200 Hz, and if the successive FFT windows move by a multiple of 1/200th of a second, then the phase will be the same in succcessive FFT windows.

But suppose that the signal is actually 201 Hz: a little faster. Then with each successive FFT window, the phase will not line up any more with the previous window; it will advance a little bit. We will see a rotating complex value: same modulus, but the angle advancing.

From how fast the angle advances relative to the time step between FFT windows, we can deduce that we are capturing a 201 Hz signal in that bin (on the hypothesis that we have a pure, periodic signal in there).

How is the phase determined in the frequency bin? It's basically a vector correlation: a dot product. The samples are a vector which is dot-producted with a complex unit vector. The complex unit vector in the 200 Hz bin is essentially a 200 Hz sine and cosine wave, rolled into a single vector with the help of complex numbers. Sine and cosine are 90 degrees apart in phase, so they form a rectilinear basis (coordinate system). The calculation projects the signal, expressing it as a sum of the sine and cosine vectors. How much of one versus the other is the phase. A signal that is 100% correlated with the sine will have a phase angle of 0 degrees or possibly 180. If it correlates with the cosine component, it will be 90 or 270. Or some mixture thereof.

Because a complex number is two real numbers rolled into one, it simplifies the calculation: instead of doing a dot product with a sine and cosine vector to separately correlate the signal to the two coordinate bases, the complex numbers do it in one dot product operation. When we go around the unit circle, each position on the circle is cos(θ) + isin(θ). These complex values values give us samples of both functions. Exactly such values are stuffed into the rows of the DFT matrix: complex values from the unit circle divided into equal divisions.

If you look here at the definition of the ω (omega) parameter:

https://en.wikipedia.org/wiki/DFT_matrix

It is the N-th complex root of unity. But what that really means is that it is a 1/Nth step of the way around the unit cicrcle. For instance if N happened to be 360, then ω is the complex number whose |ω| = 1 (unit vector), and whose modulus is 1 degree: one degree around the circle. The second row of the DFT matrix has 1, ω, ω², ω³, ... the second row represents the lowest frequency (after zero, which is the first row). It captures a single cycle of a sine and cosine waveform, in N samples. The values in that row step around the unit circle in the smallest increment, so they go around the circle exactly once. The subsequent rows go around the circle in skipped steps, yielding higher frequencies: 1, ω², ω⁴ for twice around the circle; 1, ω³, ω⁶ for three times, ... we get all the harmonics up to our N resolution.

> on the hypothesis that we have a pure, periodic signal in there

That pure sine wouldn't generate any artefacts. It would result in a 200Hz output from the AI if it throws the phase information out. You wouldn't hear a difference unless its an (aptly so called) complex signal. Eg. 200 and 201 Hz layered is an impure signal with a period below 1Hz, far outside the scope. Eventually the signals will cancel out completely. [1]

The important point is, I think, that FFT doesn't simply look at the offset aka phase. Rather, 201 Hz looks like a 200 Hz that is moving. So it encodes phase-shift in the delta of the offset between two windows. For a sum of 200 and 201 Hz it has to assume that the magnitude is also changing, which I find entirely counterintuitive.

From the mathematical perspective, this seems like a borring homework, far detached from accoustics. So, I don't know. The funny thing is that rotation is very real in the movement of strings. If the orbit in one point is elliptic, that's like two sinusoids at different magnitudes offset by some 90 degree, in a simplified model. But it has nearly infinite coupled points along its axis. As they exite each other, and each point has a different distance to the receiver, that's where phase shift happens.

> If you look here at the definition of the ω (omega) parameter

I wasn't going to make drone, but I will take a look.

1: https://graphtoy.com/?f1(x,t)=100*sin(x)&v1=true&f2(x,t)=100...

I wonder if this could be improved by using the Hartley transform instead of the Fourier transform.
Considering Stable Diffusion generates 3-channel (RGB) images, maybe it would be possible to train it on amplitude and phase data as two different channels?
People have tried that, but the model essentially learns to discard the phase channel because it is too hard for it to learn any useful information from it.
Got any citations... that sounds like a fascinating thing to read about.
We took a look at encoding phase, but it is very chaotic and looks like Gaussian noise. The lack of spatial patterns is very hard for the model to generate. I think there are tons of promising avenues to improve quality though.
Phase itself looks random, but what makes the sound blurry is that the phase doesn't line up like it should across frequencies at transients. Maybe something the model could grab hold of better is phase discontinuity (deviation from the expected phase based on the previous slices) or relative phase between peaks, encoded as colour?

But the same thing could be done as a post-processing step, finding points where the spectrum is changing fast and resetting the phases to make a sharper transient.

That makes a lot of sense, I would be keen to see attempts at that.
I'm curious why, instead of using magnitude and phase, you wouldn't use real and imaginary parts?
There have been some attempts at doing this, some of which have been moderately successful. But fundamentally you still have the problem that from the NN's perspective, it's relatively easy for it to learn the magnitude but very hard for it to learn the phase. So it'll guess rough sizes for the real and imaginary parts, but it'll have a hard time learning the correct ratio between the two.

Models which operate directly on the time domain have generally had a lot more success than models that operate on spectrograms. But because time-domain models essentially have to learn their own filterbank, they end up being larger and more expensive to train.

I wonder if there might be room for a hybrid approach, with a time-domain model taking machine-generated spectrograms as input and turning them into sound. (Just a thought, no idea whether it actually makes sense.)
would it be an approach to use separate color channels for the freq amplitude and freq phase in the same picture? Maybe the network then has a better way of learning the relationships and there would be no need for the postprocessing to generate a phase.
RAVE attacks the phase issue by using a second step of training. I don't completely understand it, but it uses a GAN architecture to make the outputs of a VAE sound better.
Griffin-Lim is slow and is almost certainly not being used.

A neural vocoder such as Hifi-Gan [1] can convert spectra to audio - not just for voices. Spectral inversion works well for any audio domain signal. It's faster and produces much higher quality results.

[1] https://github.com/jik876/hifi-gan

If you check their about page they do say they're using Griffin-Lim.

It's definitely a useful approach as an early stage in a project since Griffin-Lim is so easy to implement. But I agree that these days there are other techniques that are as fast or faster and produce higher quality audio. They're just a lot more complicated to run than Griffin-Lim.

Author here: Indeed we are using Griffin-Lim. Would be exciting to swap it out with something faster and better though. In the real-time app we are running the conversion from spectrogram to audio on the GPU as well because it is a nontrivial part of the time it takes to generate a new audio clip. Any speed up there is helpful.