Hacker News new | ask | show | jobs
by lizen_one 1392 days ago
I guess the text was extracted using two different methods. One results in 0.8TB and the other in 0.5TB text.

1) I assume 1TB (not TiB) of uncompressed (?) text

2) I assume one character is one byte

3) I assume 5 (actually it seems to be 4.7 in English) characters per word

So 1TB/1B/5 = 1.0E12/5 ~= 2.0E11 = 0.2T = 200B words.

Your article mentioned that Chinchilla is trained on 1.4T tokens. So there is quite some difference.

The article also mentions different mysterious book data sets with 27B tokens, 560B tokens, or 390B tokens.

The latter datasets were made by Google. So you are still behind Google massive book dataset even if you use probably the largest book dataset "available" to people or instituions outside of Google.

EDIT: I thought I made a mistake, but T stands for trillion or tera which are both 1E12.