Hacker News new | ask | show | jobs
by mdeck_ 1488 days ago
> I think if you have the luxury of assuming every token is a dictionary word, you can do much better by simply encoding each word as its index in the dictionary.

Then you have to store the dictionary.

3 comments

It's the standard pool/index tradeoff : instead of storing an array of possibly-duplicate objects, you simply store all unique objects once in a pool and store the array as an array of references into that pool.

You win if the original array of duplicate objects was so long or so duplicated that replacing it with references into the pool is a worthwhile reduction. The objectSize/indexSize ratio also plays some role.

The entire English dictionary (being very generous on what a "word" is) is around 4MB - nothing nowadays.

Not to mention of course that your computer probably already has 50 copies of it somewhere if you really don't want to bundle it

The vast majority of text objects anybody generates won't exceed 4MB though. 4MB is 4 million ascii/utf8 characters, if we assume a typical word in English is 4 characters for simplicity, that's 1 million words without spaces. A quick Google search for "Novel that is 1 million words" yields the fact that this is twice the word count of Lord of the Ring and the word count of the first 3 A Song of Ice and Fire (Game of Thrones) books. Accounting for whitespace, longer common words, and inefficient encoding schemes would bring that overestimate down to, what, 150K words? a 300-page or 600-page book (depending on the spacing) according to Google, still massive.

I see it only working where there's massive pooling, like you say. An OS or a tool provides a known dictionary service and you call into it with a text string and get back an array of indices that you can then decode with another call. That amortizes the dictionary cost among all possible uses of the service, which is a lot if it's a standard well-known service. Another scenario is perhaps in a database\cloud storage\social media, any single text object might be small but the total text they store overall is massive.

Except that the corpus doesn't need to be stored as part of the compressed message, and can be considered part of the compression algorithm. It increases the size of the decoder by ~4MB, but doesn't increase the size of each message.
The autocorrect method also uses a dictionary, and then some.