Hacker News new | ask | show | jobs
by schopra909 142 days ago
That all being said, you can just delete the T5 from memory after encoding the text so save on memory.

The 2B parameters will take up 4 Gb of memory but activations will be a lot more given size of context windows for video.

A 720p 5 second video is roughly 100K tokens of context