You're going to be waiting a while, even now these models can't get the details right. Look at all the example pictures, everything is just wrong. You can obviously see what it is trying to get at, but it can't get there. Another example of the last 10% taking 90% of the time.
Jukebox. If you listen to Jukebox samples, recall that that was quite a while ago in dog/DL years, and imagine what the DALL-E 2 equivalent would be for a Jukebox 2...
I'm surprised no one has tried to launch a music generation startup based on Jukebox. I'd be interested in collaboration if anyone wants to work on it (and has compute resources).