| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 123pie123 1896 days ago
	I've always thought could you compress english text easily by getting the top X number of words/phrases and then using rarely used ascii letters/ combinations to represent those words. Then extending this concept for the remainding words using the most commonly used word groups (eg -ight) I've been too way too busy (cough.. lazy) to test this idea

2 comments

compressedgas 1896 days ago

That's what Re-Pair does. It repeatedly replaces the most frequent substring with a symbol until the input has no repeating substrings.

See https://en.wikipedia.org/wiki/Grammar-based_code

link

kwhitefoot 1896 days ago

I had to do this by hand in the 1980s when I ran out of space in the ROM of the embedded controller I was working on for messages to be printed. There was no way to increase the size of ROM (2716 EPROM) at that stage in the development. The text was all 7 bit ASCII so I used the MSB as a flag to say that the remaining bits were a pointer into a table of frequently used fragments. This saved enough space to include the decompressor; it was recursive too so the fragments could also be compressed.

link

web007 1896 days ago

This is how LZW and Huffman compression work. LZW finds repeated symbol groups and encodes them as new symbols, and Huffman compression codes frequent symbols with fewer bits and infrequent symbols with more bits.

link