|
|
|
|
|
by pavelstoev
155 days ago
|
|
If we think of GenAI models as a compression implementation. Generally, text compresses extremely well. Images and video do not. Yet state-of-the-art text-to-image and text-to-video models are often much smaller (in parameter count) than large language models like Llama-3. Maybe vision models are small because we’re not actually compressing very much of the visual world. The training data covers a narrow, human-biased manifold of common scenes, objects, and styles. The combinatorial space of visual reality remains largely unexplored. I am looking towards what else is out there outside of the human-biased manifold. |
|
1: Although it looks like the current Hutter competition leader is closer to 9:1, which I didn't realize. Pretty awesome by historical standards.