Hacker News new | ask | show | jobs
by fudged71 928 days ago
If you feed key frames stitched together from the video through the GPT-4V vision model, the vision model can ensure that the steps align with the “story” shown in the images.