Hacker News new | ask | show | jobs
Show HN: AI prompt-to-storyboard videos w/ GPT, Coqui voices, StabilityAI images (meyer.id)
29 points by tensorflowz 1154 days ago
I had 2 weeks off from work and wanted a pet project before heading back. With GPT and Generative AI in the news, I decided to chain multiple Al products together to build something really cool. I set my end goal to be: prompt-to-storyboard (aka fun videos generated purely via generative AI).

There exists some prompt-to-video products, but I wanted to tell stories with audio as well. The end product takes an initial prompt and produces a series of images and audio files, which I then combine (with subtitles) into the final video. To showcase videos, there is a basic upvote/downvote leaderboard.

Text | OpenAI https://openai.com/

Text is generated in a few high-level steps that I ask GPT to work through. These are all based on the initial user prompt, as such (ideally) indirectly controlled by the user.

  - Create a concept for a movie scene based on the prompt, including the theme and setting
  - Define each character in the scene
      - Define how each character looks
      - Define how each character sounds
  - Define 'frames' of the storyboard
All of this textual information is defined in a JSON object I describe to GPT. I then take GPT's output and build the storyboard with the tools below.

Voices | Coqui https://coqui.ai/

From the GPT output, I needed three major pieces of information to build voices in a way that I found satisfying:

  - Description of the voice
  - Description of the performance
  - Text of the actual dialog spoken
Coqui has a product called 'prompt-to-voice', where you can describe how a character will sound and a custom voice is made for that character - this is how GPT defines the characters to use in the storyboard. As such, every voice is unique per storyboard. GPT will decide that a certain character is an "older man with a raspy voice", and I'll ask Coqui for that type of voice. In addition to this, in order to describe the performance, GPT outputs a basic emotion to summarize the line of dialog (happy, sad, angry, etc) - this is also sent to Coqui per audio clip generated.

Images | Stability AI https://stability.ai/

While I originally setup the storyboard generator to use DALL-E due to already integrating with OpenAl for GPI, I found the cost prohibitive. As such, the images generated for the storyboards are from Stability Al's Stable Diffusion (stable-diffusion-512-v2-1). I combine the description of the frame that GPT provides, in addition to the theme and setting that GPT output for the whole storyboard, to generate each frame. Since GPT controls the data sent to Stable Diffusion with the description of the frame as well as the theme and setting, if your prompt dictates a theme it should hopefully translate into a theme in your storyboard.

Both the storyboard and the 'prompt enhanced' image generation in the 'Create Content' tab pre-feed a GPT request with a summary of Stability Al's prompt guide. It will try and pick keyword weights to improve the image, and much like the setting and theme, keywords should be influenced by the initial prompt provided to the product.

Conclusion: Have fun and make my 2 weeks of work seem worth it!

Voting on storyboards and creating storyboards both require a simple Google login to get access.

6 comments

Super cool this is for screenwriters to knock ideas before making hige committments. Anything to get an idea away from the page.

Fantastic.

It’s really exciting stuff!

I built something in a similar space this past weekend, for the purpose of education. I’ve wanted to build personalized education for a while, and the tech is finally catching up!

https://twitter.com/_jason_today/status/1647827414901465091?...

Super cool! Being able to chain together multiple generative models seems like it will definitely be able to help people who normally aren't the creative type to start getting creative. Or at least give them easy-to-use tools so more people can make more content :)
I, for one, found this hilarious
> All of this textual information is defined in a JSON object I describe to GPT

Do you get json back from chat gpt as well? Is this consistent? I hadn't really though about using it as an api platform.

Yes, I get back only JSON. I found that providing a sample object is the best way to be consistent. I also have some basic validation on fields and have follow-up calls if needed. For example, dialog must be 1 character and at max 255. If I see empty dialog or too long dialog (or any other invalid fields), I pass back the JSON to GPT as well as the basic list of 'this property is wrong due to X' and make it provide a new JSON object.
GPT4 is much better than GPT3.5 at producing valid JSON and nothing else.

What I’ve found quite useful is to extract text from the first { and last }. This solves the problem of GPT adding some “helpful” explanations outside the JSON.

I really wanted to use GPT-4 but still just waiting in line :( I should add in your tip, would save me a retry call every now and then I’m sure.
There are also some more lenient JSON parsers that will handle comments and other non-standard things. They can be a bit hard to add in to some frameworks as it’s kind of the opposite of what you are normally trying to do which is reject bad input.
Any plans to share the code? This flow might be useful in many things.
Not sure what my plan will be there - I think the way I define the JSON and validate it could be worthwhile so I may either try and extract it out to some package I can publish or possibly have it as an endpoint for people to consume.

Is this the part you think would be most interesting to others or was there another part of the flow you were referencing?

Yes that do sounds useful though you are right it depends on the implementation .
Oh this is cool:)