Hacker News new | ask | show | jobs
by rdedev 1254 days ago
Each frame of the image would have to be divided into many sequences. Atleast that's how transformer based image models work. Then you have to account for audio data too in the same way. It just blows up the compute required