| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pavelstoev 155 days ago
	If we think of GenAI models as a compression implementation. Generally, text compresses extremely well. Images and video do not. Yet state-of-the-art text-to-image and text-to-video models are often much smaller (in parameter count) than large language models like Llama-3. Maybe vision models are small because we’re not actually compressing very much of the visual world. The training data covers a narrow, human-biased manifold of common scenes, objects, and styles. The combinatorial space of visual reality remains largely unexplored. I am looking towards what else is out there outside of the human-biased manifold.

3 comments

CamperBob2 155 days ago

Images and video compress vastly better than text. You're lucky to get 4:1 to 6:1 compression of text (1), while the best perceptual codecs for static images are typically visually lossless at 10:1 and still look great at 20:1 or higher. Video compression is much better still due to temporal coherence.

1: Although it looks like the current Hutter competition leader is closer to 9:1, which I didn't realize. Pretty awesome by historical standards.

link

murderfs 155 days ago

> Generally, text compresses extremely well. Images and video do not.

Is that actually true? I'm not sure it's fair to compare lossless compression ratios of text (abstract, noiseless) to images and video that innately have random sampling noise. If you look at humanly indistinguishable compression, I'd expect that you'd see far better compression ratios for lossy image and video compression than lossless text.

link

regularfry 154 days ago

The comparison makes sense in what I am charitably assuming is the case the GP is referring to: we know how to build a tight embedding space from a text corpus, and get out outputs from it tolerably similar to the inputs for the purposes they're put to. That is lossy compression, just not in the sense anyone talking about conventional lossless text compression algorithms would use the words. I'm not sure we can say the same of image embeddings.

link

stkdump 154 days ago

I find it likely that we are still missing a few major efficiency tricks with LLMs. But I would also not underestimate the amount of implicit knowledge and skill an LLM is expected to carry on a meta level.

link