|
|
|
|
|
by bartwr
657 days ago
|
|
It's the other way around - in hearing, phase is almost irrelevant. At medium frequencies, moving head by a few centimeters changes phase wand phase relationships of all frequencies - and we don't perceive it at all! Most audio synthesis methods work on variants of spectrograms and phase is approximated only later (mattering mostly for transients and rapid frequency content changes). In images, scrambling phase yields a completely different image. A single edge will have the same spectral content as pink/brown~ish noise, but they look completely unlike one another. |
|
So when generating audio I think the next chunk needs to be continuous in phase to the last chunk, where in images a small discontinuity in phase would just result in a noisy patch in the image. That's why I think it should be somewhat like video models, where sudden, small phase changes from one frame to the next give that "AI graininess" that is so common in the current models