Y
Hacker News
new
|
ask
|
show
|
jobs
by
fudged71
928 days ago
If you feed key frames stitched together from the video through the GPT-4V vision model, the vision model can ensure that the steps align with the “story” shown in the images.