|
|
|
|
|
by PaulHoule
516 days ago
|
|
It's easy to compress a sorted dictionary by turning something like a
ab
abc
to a
1b
2c
where the prefix is the number of characters shared with the last word (and might be coded as a byte as opposed to a digit like you're thinking there. This would be a simple form of compression to code up in BASIC unlike Huffman or most LZW variants which involve bit twiddling and maintaining tree data structures... Though I do remember writing SQ and USQ [1] in BASIC)[1] https://techtinkering.com/articles/compression-and-archiving... |
|
Meanwhile, the approach of the below 3 files (which would need only tiny adaptations to 1975 C) should have taken about 4 seconds at 200 kB/sec IO time (see below). This kind of technique was quite common in the 1960s database community also. So, it's a bit weird for the Bell labs guys to not have known about it, and it shows off the C/Unix system just as well. Here is an example implementation:
delta.c:
words.c: notIn.c: To back up my 4.2 seconds at 200 kB/sec, you can find any SOWPODS dictionary and it compresses to about 837 KiB with that delta.c. 837/200 = 4.185.If the rejoinder is "that SOWPODS stuff had/has licensing trouble" then no problemo - just use whatever in house dictionary they used and stemming / auto-suffix junk and use that to synthesize an exact dictionary. Then you can correct it as you go and et voila. In fact, if you wanted to be even faster than this 15..20X faster and then make accuracy-perf trade-offs, then you could probably shrink IO by generating the delta-encoded stream directly from the suffix rules.
I'd recommend staying exact, though. In this case, it seems a bad algo idea led them to be both inaccurate and slower and the writing was on the wall that hardware was getting faster. And honestly, my SOWPODS dictionary that seems 15X faster may well have better coverage than what they did which means an at-the-time apples to apples might have been 20..25x faster.
As a kind of data-compression / next optimizations side-note, the 267751 SOWPODS I compressed to 857065 bytes this way can only be Zstd -19'd down to about 588040 bytes. It's all mono-case and with simply a array-of-5-bits encoding, you could honestly get that 857065 down to (857065-267751)5+26775110 bits = 703010 bytes, less than 1.2X bigger than Zstd -19, but only 1.21X better than the more naive encoding above. So, you know, simple delta encoding works like gangbusters on a sorted spelling dictionary, has a very simple set membership algo (as above), and seems like it was at least an order of magnitude faster than what they actually did instead. I'd be a bit surprised if no one pointed this out at the time.