| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mjburgess 353 days ago

Anna's Archive full torrent is O(1PB), project gutenberg is O(1TB), many AI training torrents are reported in the O(50TB) range.

Extract just the plain text from that (+social media, etc.), remove symbols outside of a 64 symbol alphabet (6 bits) and compress. "Feels" to me around a 100TB max for absolutely everything.

Either way, full-fat LLMs are operating at 1-10% of this scale, depending how you want to estimate it.

If you run a more aggressive filter on that 100TB, eg., for a more semantic dedup, there's a plausible argument for "information" in english texts available being ~10TB -- then we're running close to 20% of that in LLMs.

If we take LLMs to just be that "semantic compression algorithm", and supposing the maximum useful size of an LLM is 2TB, then you could run the argument that everything "salient" ever written is <10TB.

Taking LLMs to be running at close-to 50% "everything useful" rather than 1% would be a explanation of why training has capped out.

I think the issue is at least as much to do with what we're using LLMs for -- ie., instruction fine-tuning requires some more general (proxy/quasi-) semantic structures in LLMs and I think you only need O(1%) of "everything ever written" to capture these. So it wouldnt really matter how much more we added, instruction-following LLMs don't really need it.