Hacker News new | ask | show | jobs
A Billion Words: Today's language modeling standard should be higher (googleresearch.blogspot.com)
9 points by vikram360 4425 days ago
1 comments

The GZ file is only 1.7GB, I imagine a densely-packed model would almost fit on a machine with 8GB of RAM, which is surprising.

http://www.statmt.org/lm-benchmark/

Along similar lines, all of the English Wikipedia is < 10GB, and about 45GB uncompressed: http://en.wikipedia.org/wiki/Wikipedia:Database_download#Eng.... That omits all the history (just the current pages), but still surprising to me how small it seems now.