| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rdedev 1301 days ago
	Each frame of the image would have to be divided into many sequences. Atleast that's how transformer based image models work. Then you have to account for audio data too in the same way. It just blows up the compute required