| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fudged71 928 days ago
	If you feed key frames stitched together from the video through the GPT-4V vision model, the vision model can ensure that the steps align with the “story” shown in the images.