> I think if you have the luxury of assuming every token is a dictionary word, you can do much better by simply encoding each word as its index in the dictionary.
It's the standard pool/index tradeoff : instead of storing an array of possibly-duplicate objects, you simply store all unique objects once in a pool and store the array as an array of references into that pool.
You win if the original array of duplicate objects was so long or so duplicated that replacing it with references into the pool is a worthwhile reduction. The objectSize/indexSize ratio also plays some role.
The vast majority of text objects anybody generates won't exceed 4MB though. 4MB is 4 million ascii/utf8 characters, if we assume a typical word in English is 4 characters for simplicity, that's 1 million words without spaces. A quick Google search for "Novel that is 1 million words" yields the fact that this is twice the word count of Lord of the Ring and the word count of the first 3 A Song of Ice and Fire (Game of Thrones) books. Accounting for whitespace, longer common words, and inefficient encoding schemes would bring that overestimate down to, what, 150K words? a 300-page or 600-page book (depending on the spacing) according to Google, still massive.
I see it only working where there's massive pooling, like you say. An OS or a tool provides a known dictionary service and you call into it with a text string and get back an array of indices that you can then decode with another call. That amortizes the dictionary cost among all possible uses of the service, which is a lot if it's a standard well-known service. Another scenario is perhaps in a database\cloud storage\social media, any single text object might be small but the total text they store overall is massive.
Except that the corpus doesn't need to be stored as part of the compressed message, and can be considered part of the compression algorithm. It increases the size of the decoder by ~4MB, but doesn't increase the size of each message.
You win if the original array of duplicate objects was so long or so duplicated that replacing it with references into the pool is a worthwhile reduction. The objectSize/indexSize ratio also plays some role.