Hacker News new | ask | show | jobs
by cb321 492 days ago
> The compression was trivial: store a suffix preceded by one byte that contained the length of the prefix that the word shared with its predecessor in dictionary order

This is more or less what was speculated upon as compatible with his stemming ideas [1], but I (and partly McIlroy1982) doubt you ever really needed the fancy hashing-compression in the first place (though I'm sure it was fun to figure out!). I got better perf (3x faster on a 5500 word document with an 82,000 word "unabridged" dictionary that prefix-compresses to only 278K which would fit on a 360K floppy of the era) than v7spell on v7 with just a merge algorithm somewhat more finely tuned against that format. Of course, on a shared computer getting either the streaming disk access (without seeks spoiling the party) or all that RAM to yourself (without swapping spoiling the party) is a guess. And, yeah, probably the background Doug mentions is all in the combined-unix-git-history by now { and it would not have been as attractive an academic paper :-) }.

[1] https://news.ycombinator.com/item?id=42931145