Hacker News new | ask | show | jobs
by b112 348 days ago
Not to mention, Language Modeling is Compression https://arxiv.org/pdf/2309.10668

So text wikipedia at 24G would easily hit 8G with many standard forms of compression, I'd think. If not better. And it would be 100% accurate, full text and data. Far more usable.

It's so easy for people to not realise how massive 8GB really is, in terms of text. Especially if you use ascii instead of UTF.

1 comments

The 24G is the compressed number.

They host a pretty decent article here: https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

The relevant bit:

> As of 16 October 2024, the size of the current version including all articles compressed is about 24.05 GB without media.

Nice link, thanks.

Well I'll fallback position, and say one is lossy, the other not.