Hacker News new | ask | show | jobs
by haykmartiros 1289 days ago
Other author here! This got a posted a little earlier than we intended so we didn't have our GPUs scaled up yet. Please hang on and try throughout the day!

Meanwhile, please read our about page http://riffusion.com/about

It’s all open source and the code lives at https://github.com/hmartiro/riffusion-app --> if you have a GPU you can run it yourself

This has been our hobby project for the past few months. Seeing the incredible results of stable diffusion, we were curious if we could fine tune the model to output spectrograms and then convert to audio clips. The answer to that was a resounding yes, and we became addicted to generating music from text prompts. There are existing works for generating audio or MIDI from text, but none as simple or general as fine tuning the image-based model. Taking it a step further, we made an interactive experience for generating looping audio from text prompts in real time. To do this we built a web app where you type in prompts like a jukebox, and audio clips are generated on the fly. To make the audio loop and transition smoothly, we implemented a pipeline that does img2img conditioning combined with latent space interpolation.

27 comments

Wow, I am blown away. Some of these clips are really good! I love the Arabic Gospel one. John and George would have loved this so much. And the fact that you can make things that sound good by going through visual space feels to me like the discovery of a Deep Truth, one that goes beyond even the Fourier transform because it somehow connects the aesthetics of the two domains.
I can simultaneously burst a bubble and provide fuel for more -- the alignment of the intrinsic manifolds of different domains has been an interesting research topic for zero shot research for a few years. I remember seeing at CVPR 2018 the first zero shot...classifier, I think? That if I recall correctly trained in two domains that were automatically basically aligned with each other enough to provide very good zero shot accuracy.

Calling it a Deep Truth might be a bit of an emotional marketing spin but the concept is very exciting nonetheless I believe.

It is a Deep Truth in that the universe is predictable and can be represented (at least the parts we interact with) mathematically. Matrix algebra is a hellova a drug. I could imagine someone developing the ability to listen to spectrograms by looking at them.
There is a whole piece in Godel Escher Bach where they look at vinyl records as alll the soud data is in there.
On deep truths: a review of the concept of harmony in the universe: https://www.sciencedirect.com/science/article/pii/S240587262...
I can't listen to them, but I can certainly point out different instruments, background noise sources and the like, and get an idea of the tone of a piece. This is easy. The hard part is distilling texture, timbre etc. of each sound.
Well it's no surprise that it kinda sorta works. Neural networks are very good at learning the underlying structure of things and working with suboptimally represented inputs. But if working with images of spectrograms works better than just samples in time domain, that is a valid and non-obvious finding.
My characterization of it as a Deep Truth might just be a reflection of my ignorance of the current state of the art in AI. But it's still pretty frickin' cool nonetheless.
Alright so this is a pretty amazing new development. I want to tell you something about what the state of the art is in AI. When you wrote that it is a deep truth it was before I actually listened to the pieces. I had just read the descriptions. At the time, I thought that you were probably right because I was thinking that music is only pleasing because of the structure of our brains it's not like vision where originally we are interpreting the world and that's where art comes from. Music is purely sort of abstract or artistic. However, after I listened to the pieces, I realised that they really sound exactly like the instruments that are making the physical noises. For example it really sounds exactly like a physical piano. So I don't know about a deep truth, but it does seem that there is a physical sense that the music represents which it can successfully mimic using this essentially image generating capability. One thing about all of these amazing AI development, is that I still make some long comments by dictating to Google. When it first got to the point that it was able to catch almost everything that I was saying I was absolutely blown away. However, it's really not that good at taking dictation, and I have to go back and replace each and every individual comma and period with the corresponding punctuation mark. Seeing such an amazing developments happening month after month year after year it makes me feel like we are really approaching what some people have called the singularity. When I read about a net positive fusion being announced my first instinct was to think oh of course it's now that that ChatGPT is available of course announcing a major fusion breakthrough would happen within days to weeks it just makes perfect sense that AI's can solve problems that have have confounded scientists for decades. To see just how far we still have to go take a look at how this comment read before I manually corrected it to what I had actually said.

-- [I copied and pasted the below to the above and then corrected it. Below is the original version. This is how I dictate to Google sometimes, on Android. Normally I would have further edited the above but in this case I wanted to show how far basic things like dictation still have to go. By the way I dictated in a completely quiet room. I can't wait for more advanced AI like ChatGPT to take my dictation.]

Alright so this is a pretty amazing our new development period I want to tell you something about out why the state of the heart is is in a i period when you wrote that it is a deep truth it was before I actually listen to The Pieces, I have just read the descriptions period at the time, I thought that you were probably right because I was thinking that music is only pleasing because of the structure of our brains it's not like vision where originally we are interpreting the world and that Where Art comes from music is purely so dove abstract or artistic period however, after I listen to the pieces, I realise that they really sound exactly like the instruments that are making the physical noises period for example it really sounds exactly like a physical piano period so I don't know about out a deep truth karma but it does seem that there is a physical sense that the music are represents which it can successfully mimic using this essentially image generating capability period one thing about all of these amazing AI development, is that I still make some long comments by dictating to Google. When it first got to the point that it was able to catch almost everything then was saying I was absolutely blown away period however, it's really not that good at taking dictation karma and I have to go back and replace each and every individual, and period with with the corresponding punctuation mark period seeing such an amazing developments happening month after month year after year ear makes me feel like we are really approaching what some people have called the singularity period when I read about out net positive fusion being announced my first Instinct was to think oh of course it's now that that chat GPT is available of course announcing a major fusion breakthrough would happen within in days to weeks it just makes perfect sense DJ eyes can solve problems that have have confounded scientists for decades period to see just how far we still have to go take a look at how this comment red before I manually corrected it to what I had actually set

As one of the meatsacks whose job you're about to kill... eh, I got nothin, it's damn impressive. It's gonna hit electronic music like a nuclear bomb, I'd wager.
As a listener, I think you're probably still safe. Can you use this to help you though? Maybe.

It's impressive what it produces, but I think it probably lacks substance in the same way the visual AI art stuff does. For the most part, it passes what I call the at-a-glanceness test. It's little better than apophenia (the same thing that makes you see shapes in clouds, faces in rocks, or think you've recognised a familiar word in a foreign language; the last one can happen more often though).

So, I think these tools will be used to do background work (ie for visuals maybe help with background tasks in CGI or faraway textures in games). I know less about audio, but I assume it could maybe help a DJ create a transition between two segments they want to combine, as opposed make the whole composition for them, but idk if that example makes sense.

Now, onto a more human point: I think that people often listen to music because it means something to them. Similar for people who appreciate visual art.

I also love interactive and light art, and I love talking to other artists at light festivals who make them because of the stories and journeys behind their art too. Humans and art are a package deal, IMO.

Edit: typos and to add: Also, I think prompt authorship is an art unto itself. I'm amazed what people can craft with it, but I'm more impressed by the craft itself than the outputs. Don't get me wrong, the outputs are darn cool, but not if you look closer. And it's impossible to look beneath the surface altogether, as there is nothing in the output but the pixels.

I think this type of generative stuff opens up entirely new possibilities. For the longest time I've wanted to host a rowing or treadmill competition, where contestants submit a music track. The tracks are mashed up with weighting based on who is in the lead and by how much.

I don't know of existing tech that can generate actual good mashups in realtime given arbitrary mp3s, but this has promise!

It's not too hard these days with open source BPM detection and stem separation libraries: https://github.com/deezer/spleeter
no, because is a function ("AI") that generates an image of a spectogram given text.

neither a set of MP3 nor a set of spectrograms from MP3s supplies the function arguments

or a connection to a path that uses that function

It says all StableDiffusion capabilities work, so you can prompt it with an image (either "img2img" or "textual inversion"). Their UI just doesn't expose it.
In general all this stuff is chopping the bottom off the market. AI art, code, writing, music, etc. can all generate passable "filler" content, which will decimate all human employment generating same.

I don't think this stuff is a threat to genuinely innovative, thoughtful, meaningful work, but that's the top of the market.

That being said the bottom of the market is how a lot of artists make their living, so this is going to deeply impact all forms of art as a profession. It might soon impact programming too because while GPT-type systems can't do advanced high level reasoning they will chop the bottom off the market and create a glut of employees that will drive wages down across the board.

Basic income or revolution. That's going to be our choice.

Basic income or revolution. That's going to be our choice.

Evolution.

We have such vast wealth and our historic methods for trying to make sure most people are taken care of are failing us. Those methods were rooted in the nuclear family with a head of household earning most of the money and jobs designed with an assumption that he had a full-time homemaker wife buying the groceries, cooking the meals etc so he could focus on his job.

We need jobs to evolve. In the US at least, we need to move away from tying all benefits (such as medical benefits and retirement) to a primary earner. We need to make it possible to live a comfortable life without a vehicle. We need to make it possible for small households to find small homes that make sense for them, both financially and in terms of lifestyle.

There is a lot we can do to make this not a disaster and make it possible for some people to survive on very little while while pursuing their bliss so that we stop this trend of pitting The Haves against The Have Nots and make the current Have Nots a group that has real hope of creating their own brilliant tech or such someday while not being utterly miserable if they aren't currently wealthy.

Those making the decisions can very well just say "WE and a 10-20% still needed just need to live comfortably, and the rest 80% can live in slums in the edge of town".
That sounds like the "revolution" option.
> Basic income or revolution. That's going to be our choice.

So many menial jobs are kind of like basic income anyway - you put in 2 hours of actual work to pad out the entire day at some shitty low end job, knowing all the time that your contribution isn't valued and that if your employer ever got their shit together your job wouldn't even be needed, and the robots are coming for it anyway. You get paid a small amount for doing nothing much useful.

The rich today are rich largely because they or their ancestors were plunderers. Perhaps they plundered the planet, exploiting the cheap energy that fossil fuels provide. Perhaps they plundered our social cohesion building skinner boxes that manipulate the minds of millions just to gain eyeballs and clicks.

Why should the bill for these past excesses fall on those who never benefited from them? In previous times, a young person of average intellect could get a job on a farm or factory and be a valued contributor. What happens when automation removes the last of these jobs - do we really expect people to put up with more and more menial and slavish existences?

Basic income is, like carbon taxes, an obvious solution. Maybe it will take off when a tipping point arrives - when the rich class decides that their repugnance to giving someone a "free ride" is overtaken by their need to have masses dulled and stupified, sitting at home with blinds drawn in front of their playstations, so they don't revolt at the obvious unfairness of the world.

Paying people to stay home and play on social media led to the mass explosion of conspiracy theories at the beginning of covid. Ultimately, unemployment and covid checks go a long way toward explaining the January 6th insurrection. Which is just to say that giving a free ride and narcotic forms of entertainment to the masses isn't necessarily the safety valve for "the rich" that it's made out to be.

Work gives people dignity. And idle hands are the devil's plaything. Put that together and UBI would be a disaster. Also, it's not "the rich class" who decides whether or not to bestow such a lifestyle on the masses... that in itself is a conspiratorial line of thought. Right down that road is the thought "hey, this UBI isn't enough!"

Jobs disappear. Other jobs replace them. Often, jobs are not fun, and often they feel meaningless, but working is still much more dignified than not working. Raising generations who've never worked and simply take their UBI and breed - what would even be the point of educating such people? Eventually they'd just be totally disposable and, no doubt, be disposed of.

Plunderers; well put. Capitalists cant lie their way to infinite growth forcasts and suck all the wealth into 401ks that do nothing but rob everyone elses grand children. Its a cycle that has been going on since existance itself, an ebb and flow that accelerates, crashes, and takes off again leaving its wake humanity as we know it.
Chopping the bottom off makes things higher up the ladder more accessible though. The original Zelda took six people multiple years to build, but one person could develop something similar but much better looking in a few weeks with Unity and AI generated assets. It obviously won't be a AAA title, but people have shown that they're happy to play slightly rough, retro games if they're fun. All this holds true for writing, music, art and other areas as well.

The big problem is that it's hard to filter through the huge amount of content being produced by humans to find things that you'll like, so we rely on kingmakers curating the culture. This means a few huge winners taking all and a lot of great creative work at the same level going unrewarded. If we can solve the content discovery problem in a more personalized and fair way and make it easier for people to support creators they like that would go a long way towards cushioning the job losses that AI will create.

> Basic income or revolution. That's going to be our choice.

Third option. Mandatory 4 day weeks.

Although I'd specify it as no one can work more than X hours a week.

And then adjust X down - or up in short timescales but likely down overall - as needed.

The competition is for "work". If AI is taking large chunks of "work" off the table. Spread the rest of it around.

Now notionally people will tell you that there is no finite "work" limit. You are effectively limiting competition.

To which I say - good. The rat race IS the competition. Don't we all want to slow it down a little? If F1 can put limits on a race, we should too for humanity.

Work smarter, not harder.

The only thing that affects whether you have a job is the Federal Reserve, not how good productivity tools are. You always have comparative advantage vs an AI, so you always have the qualifications for an entry level job.

There will never be a revolution and there's no such thing as late capitalism. Well, not if the Fed does their job.

I see a lot of AI naysayers neglecting the comparative advantage part.

If AI completely eliminates low skill art labour from the job pool, it's not like those affected by it are gonna disintegrate, riot, and restructure society. They have the choice of filling an art niche an AI can't or they can spend that time learning other, more in-demand skills. This also ignores that fact that some companies would rather reallocate you to more profitable projects even if your art skills don't change.

Selling a product with relative value like a painting or a sculpture will always be an uphill battle. Now that there's more competition from AI, it just gives artists/businesses incentive to find what people want that an AI can't deliver. Worst case scenario, employment rates in this sector are rough while the market recalibrates. Interested to see how these technologies develop.

That seems a bit like wishful thinking.

People don't have unlimited ability to learn new skills. Training takes time, and someone who spent several years honoring their craft won't be able to pick up a new skill overnight.

On top of that, people have preferences regarding their work – even if someone has the ability to do a different work, they might find it less meaningful and less satysfing.

Finally, don't ignore the speed at which AI capabilities improve. Compare GPT-1 with the current model, and how quickly we got here. Eventually we'll get to a point where humans just won't be able to catch up quickly enough.

I think specifically in the area of creative "products" such as art and music you have to think about the customer as well. I have zero interest in AI-created art or music. None. The value of art is its humanity; its expression of the artist's message, vision, and passion. AI doesn't have that, so it's not of any interest to me.

I don't know how many custoners feel the same way, but I won't be purchasing any AI art or music or knowingly giving it any of my attention.

The top of the market started at the bottom. Entry level is requiring higher and higher skills and capabilities.
> basic income or revolution

I’ve been trying to play through the scenario in my head. At least in terms of software developers being replaced by AI, I think we’re going to first see AI doing work in parallel or under monitor by humans. Basically, Google will take AI and send it off to do work that they lack the staff to do. Now, on the other hand, they could also temporally play it out where first they feign an inability to staff people due to finances so there are layoffs/terminations, and then maybe a quarter later they replace those people with low cost AI compute time that is orders of magnitude more productive.

In any case, AI disrupting people’s ability to feed, shelter, and clothe themselves is sure to trigger a pretty brutal and hostile response, which would be grounds for legislation and perhaps a class war.

The weird part is that if the potential of AI is truly orders of magnitude expansion beyond what we already have, then the longterm surely has room for a tiny little mankind fief. But, in order to get to the long term our hyper-competitive technocratic overlords may strangle out part of or all of the rest of us while justifying accelerating through the near-term window to achieve AI-dominance.

If the only people who can have meaningful good paying jobs are thoughtful geniuses we're in a lot of trouble as a society still.
> Basic income or revolution. That's going to be our choice.

I fear you are right. But neither of those is going to be an easy transition, if only because the effects of all this innovation is felt disproportionally by people in countries where such a revolution will not do anything to give solace.

Basic income assumes that the funds to do this are available and revolution assumes that the powers that be are the parties that are in the way of a more equitable division of the spoils. Neither of those are necessarily true for all locations affected.

>Basic income or revolution. That's going to be our choice

Basic income sounds good in theory in some imaginary futuristic society of harmony and grace.

In real life, it's a way for the masses be controlled down to your very substinence by the state. Where the state is basically an intermediary for big private interests and lobbies.

It gets better very quickly and we have no idea where its limitations are. In other words, we have no idea when the development will slow down significantly and how much of the bottom will it have chopped down by then. Whether it's 10% or maybe a 100.

> Basic income or revolution. That's going to be our choice.

I'm definitely pro basic income, but I've heard an interesting remark a few weeks ago. And that's that COVID was kind of a UBI experiment (in the US), albeit very limited, and it turned out that if people don't have to worry about making a living and don't have a job to work in then they'll start do stupid things on the internet. Like make up stupid conspiracy theories about vaccines. I can't remember who said this, it was one of the guests on Lex Fridman's podcast. I'm also not sure if it's a valid analogy but reminds me of Vonnegut's Player Piano.

As a musician and listener i'm inclined to agree. There were a couple cool examples i bumped into, but some prompts generate results that don't represent any single word or combination of words that were presented to the AI.

What this means for the future is maybe a little more unsettling however.

I fully agree with what you wrote. This AI-generated music, while a great achievement, still sounds soulless. It's one thing to look at AI-generated pictures for a few seconds, but listening to this music with its gibberish "lyrics" for minutes really creeps me out - it's the "uncanny valley" all over again, I guess.

Regarding "can you use this to help you through?" - yeah, you could probably use it as a source of inspiration, but at the risk of getting sued by someone whos music you didn't even know you were copying...

Yea, it's uncanney valley, sure. For now.

With Stable Diffusion and similar generative systems we have seen a leap in generative art/media, partially with significant improvements within a few months. What makes you think this was the last or only leap in the next 5 to 10 years? As if progress would just stop here? Huh?!

Do you think we hit a ceiling were progress is only tangential? A line which is impossible to cross? Otherwise I dont get this mindset in the face of these modern generative AI systems popping up left and right.

I think this is a good point. To make this useful for music creators, and to make music creation more generally accessible, the output needs to be more useful. We are working on that at https://neptunely.com
Potentially it could be used as temporary atmospherical music for pre-viz video shots
These are tools. Don't think of them as replacements, they aren't. But as tools that will help us be creative. As smart as these apps seem, they will still need a human to decide where and how to use them. They won't replace us but we need to adapt to a new reality.
I hear this a lot (in relation to various jobs) and I still don't get it. Yes, it is a tool. Yes, if it can, it will replace humans. That's the whole point.

For some reason people tend to think that these tools/AI/ML systems will never be good enough to do their job (or a specific job). This argument can take different forms, sometimes stating that it will just do the boring part of the work (e.g. with programming) or that it will still need human creativity (maybe, but not necessarily and that's not the point) or that it will just replace low level, unskilled or mediocre professionals. And somehow everyone thinks they are not mediocre (i.e. average). But even these assumptions are unfounded. Why would anyone think that these systems will top out below their skill levels? Why would anyone think that they can't become superhuman?

They did in chess, go, I think poker too. Not to mention protein folding. And without much of a hitch between mediocre/good enough and superhuman. Because that difference is just interesting for us, but doesn't necessarily mean that there is huge step, that the system needs to undergo serious development and that it would take a long time. (Like decades or so.) People thought that was the case when AlphaGo beat Fan Hui saying that Lee Sedol was a completely different level. Which, of course, he is. Still, it just took DeepMind half a year to improve alphago to that level.

So yeah, you can be pretty sure that if this track (no pun intended), if this solution is good enough then it will quickly evolve into something that will replace some music creators.

>As smart as these apps seem, they will still need a human to decide where and how to use them. They won't replace us

Well, they will, if the AI plus 1 human deciding "where and how to use them" can replace producers and musicians playing...

It already is a replacement. You can make a visual novel video game with AI generated character art, backgrounds, music, run your dialogs though AI if you can't write well yourself - and your game will have higher production level than 90% of competition. All those artist you would normally hire or commission the above stuff from are now out of the process if you want. Sure, it's not a particularly high bar, but it's only going to raise from here.
Kurt Vonnegut and Player Piano has a message for you.
These will be full replacements in no time, give or take 10 years.
10 years‽ StableDiffusion was released August 22nd of this year!
I love simple generative approaches to get ideas, and go from there. This seems like an extension of that (well, it's what I'm going to try - sample the output, make stems, pull MIDI etc). Will make the creative process more interesting for me, not less.

Having said that, it's not my job, and I can see where the issues lay there.

I can't think of a genre that would embrace it faster. The pay-for-knock-off rap beat market will feel more pressure from this kind of tool, especially as loop-oriented as it already is.
Why do you think this will kill your job? To me this looks like an extension of the hip-hop genre.
I am an active musician, but I don't actually make money at it, I was mostly joking, but: I believe that we are (one determined smart person + six months) away from bots on Youtube and other streaming platforms that generate endless "new" music that follows those simple formulas (beat, bass, sample = loop, several loops connected up = song) 24/7.

Raves that have no human DJs and never stop.

Good? Bad? Not for me to say, really, the most I make at it is a couple hundred bucks for two nights in a bar playing FM radio hits, and there's lots of people younger than me who like that music, so obviously I'm doing it for different reasons and I don't anticipate losing access to as many bar gigs as I want for the rest of my life.

But certain genres are very tolerant of low-effort music, and I think the people who are monetizing low-effort music are gonna lose their income streams. I do different things than those people, but I still consider them compatriots, even if I don't care for their art.

Isn't this just a sampler with extra steps?
All the AI music I’ve heard so far has a really unpleasant resonant quality to it. Why is that? Can it be removed?
I've done some work on AI audio synthesis and the artifacts you're hearing in these clips are coming from the algorithm that is used to go from the synthesized spectrogram to the audio (the Griffin-Lim algorithm).

Audio spectrograms have two components: the magnitude and the phase. Most of the information and structure is in the magnitude spectrogram so neural nets generally only synthesize that. If you were to look at a phase spectrogram it looks completely random and neural nets have a very, very difficult time learning how to generate good phases.

When you go from a spectrogram to audio you need both the magnitudes and phases, but if the neural net only generates the magnitudes you have a problem. This is where the Griffin-Lim algorithm comes in. It tries to find a set of phases that works with the magnitudes so that you can generate the audio. It generally works pretty well, but tends to produce that sort of resonant artifact that you're noticing, especially when the magnitude spectrogram is synthesized (and therefore doesn't necessarily have a consistent set of phases).

There are other ways of using neural nets to synthesize the audio directly (Wavenet being the earliest big success), but they tend to be much more expensive than Griffin-Lim. Raw audio data is hard for neural nets to work with because the context size is so large.

Phase is crtical for pitch. Here is why. The spectral transformation breaks up the signal into frequency bins. The frequency bins are not accurate enough to convey pitch properly. When a periodic signal is put through a FFT, it will land into a particular frequency bin. Say that the frequency of the signal is right in the middle of that bin. If you vary its pitch a little bit, it will still hand into the same bin. Knowing the amplitude of the bin doesn't give you the exact pitch. The phase information will not give it to you either. However, between successife FFT samples, the phase will rotate. The more off-center the frequency is, the more the phase rotates. If the signal is dead center, then each successive FFT frame will show the same phase. When it is off center, the waveform shifts relative to the window, and so the phase changes for every sample. From the rotating phase, you can determine the pitch of that signal with great accuracy.
Yes, this is exactly right and is why Griffin-Lim generated audio often has a sort of warbly quality. If you use a large FFT you can mitigate the issues with pitch because the frequency resolution in your spectrogram is higher, so the phase isn't so critical to getting the right pitch. But the trade-off of a bigger FFT is that the pitches now have to be stationary for longer.

The other place where phase is critical is in impulse sounds like drum beats. A short impulse is essentially just energy over a broad range of frequencies, but the phases have been chosen such that all the frequencies cancel each other out everywhere except for one short duration where they all add constructively. Without the right phases, these kinds of sounds get smeared out in time and sound sort of flat and muffled. The typing example on their demo page is actually a good example of this.

So what is phase? From dabbling with waveforms in audio editors, sampling, and later learning a little bit about complex numbers, phase seems eventually equivalent to what would sound like changing pitch, modulating the frequency of a periodic signal.

The simplest demonstration of it is the doppler shift. But it's not at all that simple because moving relative to the source the sound pressure and thus the perceived loudness also change, distorting the wave form, thereby introducing resonant frequencies. Now imagine that the transducer is always moving, eg. a plucked string.

The ideal harmonic pendulum swings periodically, only losing attenuation. But the resonant transducer picks up reflections of its own signal, like coupled pendulums, which are intractable according to the three body problem.

On top of that, our hearing is fine tuned to voices and qualities of noise.

Phase is the offset in time. The functions sin(θ) and sin(θ + c), for arbitrary real c, represent the same frequency signal; they are offset from each other horizontally by c, and that c is a phase difference. It has an interpretation as an angle, when the full cycle of the wave is regarded as degrees around a circle; and that's what I mean by rotating phase.

When you take a window of samples of a signal, and run the FFT on it, for every frequency bin, the calculation determines what is the amplitude and phase of the signal. If you have a frequency bin whose center is 200 Hz, and there is a 200 Hz signal, then what you get for that frequency bin is a complex number. The complex number's magnitude ("modulus") is the amplitude of that signal, and its angle ("argument"d) is the phase.

If the signal is exactly 200 Hz, and if the successive FFT windows move by a multiple of 1/200th of a second, then the phase will be the same in succcessive FFT windows.

But suppose that the signal is actually 201 Hz: a little faster. Then with each successive FFT window, the phase will not line up any more with the previous window; it will advance a little bit. We will see a rotating complex value: same modulus, but the angle advancing.

From how fast the angle advances relative to the time step between FFT windows, we can deduce that we are capturing a 201 Hz signal in that bin (on the hypothesis that we have a pure, periodic signal in there).

How is the phase determined in the frequency bin? It's basically a vector correlation: a dot product. The samples are a vector which is dot-producted with a complex unit vector. The complex unit vector in the 200 Hz bin is essentially a 200 Hz sine and cosine wave, rolled into a single vector with the help of complex numbers. Sine and cosine are 90 degrees apart in phase, so they form a rectilinear basis (coordinate system). The calculation projects the signal, expressing it as a sum of the sine and cosine vectors. How much of one versus the other is the phase. A signal that is 100% correlated with the sine will have a phase angle of 0 degrees or possibly 180. If it correlates with the cosine component, it will be 90 or 270. Or some mixture thereof.

Because a complex number is two real numbers rolled into one, it simplifies the calculation: instead of doing a dot product with a sine and cosine vector to separately correlate the signal to the two coordinate bases, the complex numbers do it in one dot product operation. When we go around the unit circle, each position on the circle is cos(θ) + isin(θ). These complex values values give us samples of both functions. Exactly such values are stuffed into the rows of the DFT matrix: complex values from the unit circle divided into equal divisions.

If you look here at the definition of the ω (omega) parameter:

https://en.wikipedia.org/wiki/DFT_matrix

It is the N-th complex root of unity. But what that really means is that it is a 1/Nth step of the way around the unit cicrcle. For instance if N happened to be 360, then ω is the complex number whose |ω| = 1 (unit vector), and whose modulus is 1 degree: one degree around the circle. The second row of the DFT matrix has 1, ω, ω², ω³, ... the second row represents the lowest frequency (after zero, which is the first row). It captures a single cycle of a sine and cosine waveform, in N samples. The values in that row step around the unit circle in the smallest increment, so they go around the circle exactly once. The subsequent rows go around the circle in skipped steps, yielding higher frequencies: 1, ω², ω⁴ for twice around the circle; 1, ω³, ω⁶ for three times, ... we get all the harmonics up to our N resolution.

> on the hypothesis that we have a pure, periodic signal in there

That pure sine wouldn't generate any artefacts. It would result in a 200Hz output from the AI if it throws the phase information out. You wouldn't hear a difference unless its an (aptly so called) complex signal. Eg. 200 and 201 Hz layered is an impure signal with a period below 1Hz, far outside the scope. Eventually the signals will cancel out completely. [1]

The important point is, I think, that FFT doesn't simply look at the offset aka phase. Rather, 201 Hz looks like a 200 Hz that is moving. So it encodes phase-shift in the delta of the offset between two windows. For a sum of 200 and 201 Hz it has to assume that the magnitude is also changing, which I find entirely counterintuitive.

From the mathematical perspective, this seems like a borring homework, far detached from accoustics. So, I don't know. The funny thing is that rotation is very real in the movement of strings. If the orbit in one point is elliptic, that's like two sinusoids at different magnitudes offset by some 90 degree, in a simplified model. But it has nearly infinite coupled points along its axis. As they exite each other, and each point has a different distance to the receiver, that's where phase shift happens.

> If you look here at the definition of the ω (omega) parameter

I wasn't going to make drone, but I will take a look.

1: https://graphtoy.com/?f1(x,t)=100*sin(x)&v1=true&f2(x,t)=100...

I wonder if this could be improved by using the Hartley transform instead of the Fourier transform.
Considering Stable Diffusion generates 3-channel (RGB) images, maybe it would be possible to train it on amplitude and phase data as two different channels?
People have tried that, but the model essentially learns to discard the phase channel because it is too hard for it to learn any useful information from it.
Got any citations... that sounds like a fascinating thing to read about.
We took a look at encoding phase, but it is very chaotic and looks like Gaussian noise. The lack of spatial patterns is very hard for the model to generate. I think there are tons of promising avenues to improve quality though.
Phase itself looks random, but what makes the sound blurry is that the phase doesn't line up like it should across frequencies at transients. Maybe something the model could grab hold of better is phase discontinuity (deviation from the expected phase based on the previous slices) or relative phase between peaks, encoded as colour?

But the same thing could be done as a post-processing step, finding points where the spectrum is changing fast and resetting the phases to make a sharper transient.

That makes a lot of sense, I would be keen to see attempts at that.
I'm curious why, instead of using magnitude and phase, you wouldn't use real and imaginary parts?
There have been some attempts at doing this, some of which have been moderately successful. But fundamentally you still have the problem that from the NN's perspective, it's relatively easy for it to learn the magnitude but very hard for it to learn the phase. So it'll guess rough sizes for the real and imaginary parts, but it'll have a hard time learning the correct ratio between the two.

Models which operate directly on the time domain have generally had a lot more success than models that operate on spectrograms. But because time-domain models essentially have to learn their own filterbank, they end up being larger and more expensive to train.

I wonder if there might be room for a hybrid approach, with a time-domain model taking machine-generated spectrograms as input and turning them into sound. (Just a thought, no idea whether it actually makes sense.)
would it be an approach to use separate color channels for the freq amplitude and freq phase in the same picture? Maybe the network then has a better way of learning the relationships and there would be no need for the postprocessing to generate a phase.
RAVE attacks the phase issue by using a second step of training. I don't completely understand it, but it uses a GAN architecture to make the outputs of a VAE sound better.
Griffin-Lim is slow and is almost certainly not being used.

A neural vocoder such as Hifi-Gan [1] can convert spectra to audio - not just for voices. Spectral inversion works well for any audio domain signal. It's faster and produces much higher quality results.

[1] https://github.com/jik876/hifi-gan

If you check their about page they do say they're using Griffin-Lim.

It's definitely a useful approach as an early stage in a project since Griffin-Lim is so easy to implement. But I agree that these days there are other techniques that are as fast or faster and produce higher quality audio. They're just a lot more complicated to run than Griffin-Lim.

Author here: Indeed we are using Griffin-Lim. Would be exciting to swap it out with something faster and better though. In the real-time app we are running the conversion from spectrogram to audio on the GPU as well because it is a nontrivial part of the time it takes to generate a new audio clip. Any speed up there is helpful.
I think this is because the generation is done in the frequency domain. Phase retrieval is based on heuristics and not perfect, so it leads to this "compressed audio" feel. I think it should be improvable
The link is down now, so I don't know about this one. But most generated music is generated in the note domain, rather than the audio domain. Any unpleasant resonance would introduced in the audio synthesis step. And audio synthesis from note data is a very solved problem for any kind of timbre you can conceive of, and some you can't.
You're probably talking about the artifacts of converting a low resolution spectrogram to audio.
Can the spectrogram image be AI upscaled before transforming back to the time domain?
Yes it exists: https://ccrma.stanford.edu/~juhan/super_spec.html

But the issue is not that the spectrogram is low quality.

The issue is that the spectrogram only contains the amplitude information. You also need phase information for generating audio from the spectogram

Interesting, can't you quantize and snap to a phase that makes sense to create the most musical resonance?
What happens if you run one of the spectrogram pictures through an upscaler for images like ESRGAN ?
It sounds kind of like the visual artifacts that are generated by resampling in two dimensions. Since the whole model is based on compressing image content, whatever it's doing DSP-wise is more-or-less "baked in", and a probable fix would lie in doing it in a less hacky way.
The first ever recordings had people shouting to get anything to register. They sounded like tin. Fast forward to today.

Looking back at image generation just a year or two ago and people would have said similar things.

Not hard to imagine the trajectory of synthesized audio taking a similar path.

Presumably for similar reasons that the vast majority of AI generated art and text is off-puttingly hideous or bland. For every stunning example that gets passed around the internet, thousands of others sucked. Generating art that is aesthetically pleasing to humans seems like the Mt. Everest of AI challenges to me.
I think your comment is off-topic to the post you are replyng to. That wasn't asking about the general aesthetic quality - more about a specific audio artifact.

> For every stunning example that gets passed around the internet, thousands of others sucked.

From personal experience this is simply untrue. I don't want to debate it because you seem to have strong feelings about the topic.

Even if you remove the artifact, the exact same comment applies. It generates a somewhat less interesting version of elevator music. This is not to crap on what they did. As I said, they underlying problem is extremely difficult and nobody has managed to solve it.

I don't feel strongly about this topic at all.

> It generates a somewhat less interesting version of elevator music.

This iteration does, but that's an artifact of how it's being generated: small spectograms that mutate without emotional direction (by which I mean we expect things like chord changes and intervals in melodies that we associate with emotional expressions - elevator music also stays in the neutral zone by design).

I expect with some further work, someone could add a layer on top of this that could translate emotional expressions into harmonic and melodic direction for the spectrogram generator. But maybe that would also require more training to get the spectrogram generator to reliably produce results that followed those directions?

The vast majority of human generated art is hideous or bland. Artists throw away bad ideas or sketches that didn’t work all the time. Plus you should see most of the stuff that gets pasted up on the walls at an average middle School.
Hard disagree. The average middle school picture will have certain aspects exaggerated giving you insights into the minds eye of the creator, how they see the world, what details they focus on. There is no such minds eye behind AI art so it's incredibly boring and mundane, no matter how good a filter you apply on top of it's fundamental lack of soul or anything interesting to observe in the picture beyond surface level. It's great for making art for assets for businesses to use, it's almost a perfect match, as they are looking to have no controversial soul to the assets they use, but lots of pretty bubblegum polish.
Perhaps most of the AI art out there (that honestly represents itself as such) is boring and mundane, but after many hours exploring latent space, I assure you that diffusion models can be wielded with creativity and vision.

Prompting is an art and a science in its own right, not to speak of all the ways these tools can be strung together.

In any case, everything is a remix.

I have to agree, the act of coming up with a prompt is one and the same with providing "insights into the minds eye of the creator, how they see the world, what details they focus on" - two people will describe the same scene with completely different prompts.
And the vast majority of professionally produced artwork is for business use. It’s packaging design or illustration or corporate graphics or logos or whatever.

I don’t get the objection.

> For every stunning example that gets passed around the internet, thousands of others sucked

…implying there may be an art to AI art. Hmm.

Meanwhile, the degree to which it is off-puttingly hideous in general can be seen in the popularity of Midjourney — which is to observe millions of folks (of perhaps dubious aesthetic taste) find the results quite pleasing.

Not sure about this. Models like Midjourney seem to put out very consistently good images.
I've compiled/run a dozen different image to sound programs and none of them produce an acceptable sound. This bit of your code alone would be a great application by itself.

It'd be really cool if you could implement an MS paint style spectrum painting or image upload into the web app for more "manual" sound generation.

Amazing work! Did you use CLIP or something like that to train genre + mel-spectrogram? What datasets did you use?
I was very surprised this was not mentioned.
/u/threevox on reddit made a colab for playing with the checkpoint:

https://colab.research.google.com/drive/1FhH3HlN8Ps_Pr9OR6Qc...

Hi Hayk, I see that the inference code and the final model are open source. I am not expecting it, but is the training code and the dataset you used for fine-tuning, and process to generate the dataset open source?
"fine-tuned on images of spectrograms paired with text"

How many paired training images / text and what was the source of your training data? Just curious to know how much fine tuning was needed to get the results and what the breadth / scope of the images were in terms of original sources to train on to get sufficient musical diversity.

The audio sounds a bit lossy, would it be possible to create high quality spectograms from music, downsample them, and use that as training data for a spectogram upscaler?

It might be the last step this AI needs to bring some extra clarity to the output.

This is amazing! This is a fantastic concept generator. The verisimilitude with specific composers and techniques is more than a little uncanny. A few thoughts after exploring today…

- My strongest suggestion is finding some strategy for smoothing over the sometimes harsh-sounding edge of the sample window - Perhaps it could be filling in/passing over segments of what is sounded to user as a larger loop? Both giving it a larger window to articulate things but maybe also showcasing the interpolation more clearly… - Tone control may seem challenging but I do wonder if you couldn’t “tune” the output of the model as a whole somehow (given the spectrogram format it could be a translation/scale knob potentially?)

When you say fine tuned do you mean fine tuned on an existing stable diffusion checkpoint? If so which?

It would be very interesting to see what the stable diffusion community that is using automatic1111 version would do with this if it were made into an extension.

Yes from https://huggingface.co/runwayml/stable-diffusion-v1-5. Our checkpoint works with automatic1111, and if you'd like to make an extension to decode to audio, it should be pretty straightforward: https://github.com/hmartiro/riffusion-inference/blob/main/ri...
Can you run this on any hardware already capable of running SD 1.5? I am downloading the model right now, might play with this later.

Guessing at the speed with which AI is developing these days someone is going to have the extension up in two hours at most.

I bet the AUTOMATIC1111 web UI music plugin drops within 48 hours.
I have made a basic version here:

https://github.com/enlyth/sd-webui-riffusion

Yes! Although to have real time playback with our defaults you need to be able to generate 40 steps at 512x512 in under 5 seconds.
Good to know. I was just so close with just under 7s using 40 steps and Euler a as sampler.
Super clever idea of course. But leaving aside how it was produced, I’ll be one of those who is underwhelmed by the musicality of this. I am judging this in terms of classical music. I repeatedly tried to get it to just play pure piano music without any other add-ons (cymbals etc). It kept mixing the piano with other stuff.

Also the key question is - would something like this ever produce something as hauntingly beautiful and unique as classical music pieces?

Hayk! How smart are you! I loved your work on SymForce and Skydio - totally wasn't expecting you to be co-author on this!

On a serious note, I'd really love some advice from you on time management and how you get so much done? I love Skydio and the problems you are solving, especially on the autonomy front, are HARD. You are the VP of Autonomy there and yet also managed to get this done! You are clearly doing something right. Teach us, senpai!

Hello - this is awesome work. Like other commenters, I think the idea that if you are able to transfer a concept into a visual domain (in this case via fft) it becomes viable to model with diffusion is super exciting but maybe an oversimplification. With that in mind, do you think this type of approach might work with panels of time series data?
Did you have a data set for training the relationship between words and the resulting sound?
Super! Makes sense since Skydio is also amazing.

How much data is used for fine tuning? Since spectrograms are (surely?) very out of distribution for the pre training dataset, how much does value does the pre training really bring?

To be honest, we're not sure how much value image pre training brings. We have not tried to train from scratch, but it would be interesting.

One thing that's very important though is the language pre-training. The model is able to do some amazing stuff with terms that do not appear in our data set at all. It does this by associating with related words that do appear in the dataset.

Hi, I really admire the skill you put at work on this project. At the same time, I think everyone is overlooking how crucial and problematic the training factor is.

Why was stable diffusion able to generate spectrograms? Because it was fed some. Presumably, those original spectrograms were scraped with little concern over creators' permissions, just like it has been for artists' work in order to produce art-looking image generation. Please, research what has been happening in the art community lately. https://www.youtube.com/watch?v=Nn_w3MnCyDY

A protest on ArtStation has been shown to influence Midjourney's results, proving that huge amounts of proprietary work are constantly scraped without the creators' permission. AIs like these work so well just because they steal and remix real artists' work in the first place. There are going to be legal wars about this.

Stable Diffusion doesn't have an official music generation Ai precisely because it couldn't train it with the same approach without being sued by music labels right away, while isolated artists don't have the same power.

So, back to my question: have you wondered whose work is Stable Diffusion remixing here? Your endeavour is great technically, but as we progress into the future we have to be more aware of the ethical implications that come with different forms of progress.

You could try to base your project on a collection of free-to-use spectograms, and see how it performs. If you do, I think it could actually be very interesting and useful to discuss the results here on Hacker News.

Cheers!

What I would really like to know - what happens if one trains that model from scratch (or is that not possible and training requirements are different? Sry for my ignorance, I never fine-tuned any diffusion model before)?

In my experience (CNN based imagery segmentation) proven architectures (e.g. U-Net) performed similar with or without fine-tuning existing models (that have been mostly trained on imagenet, citiscapes, etc.) IF the domain was rather different.

At least in the field of imagery segmentation there is not much of a point in fine-tuning an off-the-shelf model on let's say medical imagery.

So maybe it's the same for the stable diffusion model. I don't see how some knowledge about the relationship between the prompt and given imagery describing that prompt should help this model map the prompt to a spectrogram of the given prompt.

You can embed images in spectrograms.. might sound weird though
This is groundbreaking! All other attempts at AI generated music have IMO, fallen flat... These results are actually listenable, and enjoyable! This is almost frightening how powerful this can be
Obviously this needs a little more polish, but I've wanted this for so long I'm willing to pay for it now if it helps push the tech forward. Can I give you money?
What sort of setup do you need to be able to fine tune Stable Diffusion models? Are there good tutorials out there for fine tuning with cloud or non-cloud GPUs?
Reach out to the Beatstars CEO. He was looking for an AI play for his music producers marketplace. Probably solid B2B lead there.
Amazing work. Can this be applied to voice?

Example prompt: “deep radio host voice saying ‘hello there’”

Kind of like a more expressive TTS?

Author here: It can certainly be applied to voice, but the model would need deeper training to speak intelligibly. If you want to hear more singing, you can try a prompt like "female voice", and increase the denoising parameter in the settings of the app.

That said, our GPUs are still getting slammed today so you might face a delay in getting responses. Working on it!

Amazing work! Do you plan on open-sourcing the code to train the model?
The site isn't working for me? Anything I have to fix on my side to make it work?
Crashes repeatedly on iOS in Firefox (my usual browser), is OK on Safari though, so probably not a webkit thing.
This is super awesome.

Have you already explored doing the same with voice cloning?

How many songs did you use for the training data?
is classical music harder? noticed you didn't have any classical music tracks. i wonder if it is because it is more structured?
funny that Hayk is an early skydio guy!

2 amazing AI projects. Huge respect :)