Hacker News new | ask | show | jobs
by 123pie123 1896 days ago
I've always thought could you compress english text easily by getting the top X number of words/phrases and then using rarely used ascii letters/ combinations to represent those words. Then extending this concept for the remainding words using the most commonly used word groups (eg -ight)

I've been too way too busy (cough.. lazy) to test this idea

2 comments

That's what Re-Pair does. It repeatedly replaces the most frequent substring with a symbol until the input has no repeating substrings.

See https://en.wikipedia.org/wiki/Grammar-based_code

I had to do this by hand in the 1980s when I ran out of space in the ROM of the embedded controller I was working on for messages to be printed. There was no way to increase the size of ROM (2716 EPROM) at that stage in the development. The text was all 7 bit ASCII so I used the MSB as a flag to say that the remaining bits were a pointer into a table of frequently used fragments. This saved enough space to include the decompressor; it was recursive too so the fragments could also be compressed.
This is how LZW and Huffman compression work. LZW finds repeated symbol groups and encodes them as new symbols, and Huffman compression codes frequent symbols with fewer bits and infrequent symbols with more bits.