Hacker News new | ask | show | jobs
by akrymski 476 days ago
I really wish this trend of prompting gen AI models with text would stop. It's really meaningless. Musicians need gen AI they can prompt with a melody on their keyboard. Or a bit of whistling into the microphone. Or a beat they can tap on the table. That is what allows humans to unleash their creativity. Not AI generating random bits that fit a distribution of training data. English language is not the right input for anything except for information retrieval tasks.
2 comments

Agreed! Those will be much more fun and we plan to support that. However, right now we're focused on making the base model slightly better, then we can easily add all of those controls (a-la ControlNets with Stable Diffusion).
But this is not easy, it's the real challenge here as there are lots of text-to-audio models out there. It is far from solved for Stable Diffusion as well. ControlNet is pretty bad. Just try taking the photo of an empty room and asking an image model to add furniture. Or to change a wall colour. Or to style an existing photo as per the style of another and so on. We are very far from being able to truly control the output generated by the AI models, which is something that a DAW excels at. I'd start with an AI-powered DAW rather than text-to-audio and try to add controls to it. It's like Cursor vs Lovable if you get my drift.
> Not AI generating random bits that fit a distribution of training data

How is that specific to text prompting? If you tap your fingers to a model and it generates a song from your tapping, it's still just fitting the training data as you say.