Hacker News new | ask | show | jobs
by wangg 328 days ago
Wouldn’t Wikipedia compress a lot more than llms? Are these uncompressed sizes?
2 comments

The downloads are (presumably) already compressed.

And there are strong ties between LLMs and compression. LLMs work by predicting the next token. The best compression algorithms work by predicting the next token and encoding the difference between the predicted token and the actual token in a space-efficient way. So in a sense, a LLM trained on Wikipedia is kind of a compressed version of Wikipedia.

Yes, they're uncompressed. For reference, `enwiki-20250620-pages-articles-multistream.xml.bz2` is 25,176,364,573 bytes; you could get that lower with better compression. You can do partial reads from multistream bz2, though, which is handy.
Kiwix (what the author used) uses "zim" files, which are compressed. I don't know where the difference come from, but Kiwix is a website image, which may include some things the raw Wikipedia dump doesn't.

And 57 GB to 25 GB would be pretty bad compression. You can expect a compression ratio of at least 3 on natural English text.