Hacker News new | ask | show | jobs
by camel-cdr 348 days ago
> All digitized books ever written/encoded compress to a few TB.

I tied to estimate how much data this actually is:

    # annas archive stats
    papers = 105714890
    books = 52670695
    
    # word count estimates
    avrg_words_per_paper = 10000
    avrg_words_per_book = 100000
    
    words = (papers*avrg_words_per_paper + books*avrg_words_per_book )
    
    # quick text of 27 million words from a few books
    sample_words = 27809550
    sample_bytes = 158824661
    sample_bytes_comp = 28839837 # using zpaq -m5
    
    bytes_per_word = sample_bytes/sample_words
    byte_comp_ratio = sample_bytes_comp/sample_bytes
    word_comp_ratio = bytes_per_word*byte_comp_ratio
    
    print("total:", words*bytes_per_word*1e-12, "TB") # total: 30.10238345855199 TB
    print("compressed:", words*word_comp_ratio*1e-12, "TB") # compressed: 5.466077036085319 TB

So uncompressed ~30 TB and compressed ~5.5 TB of data.

That fits on three 2TB micro SD cards, which you could buy for a total of 750$ from SanDisk.