|
|
|
|
|
by cb321
501 days ago
|
|
Actually, @probably_wrong was actually right above. It took a while, but I found Bentley 1986 Programming Pearls Column 13 describes Kernighan & Plaugher 1981 as describing the original v6 spell in more detail as indeed the smart merge technique: concat ..|translit A-Z a-z|translit ^a-z @n|sort|unique|common -2 wds
For the curious, in modern parlance that Unix v6 spell is: tr A-Z a-z | tr -c a-z \\n | sort -u | comm -2 -3 - someDict
So, A) mea culpa & this corrects the record and B) my measurements above suggest that @Paul_Houle's delta encoding idea alone would have given ~4X IO scan reduction on unabridged with no word stemming accuracy trade-offs. { Delta coding is, of course, combinable with word stemming in the same way as McIlroy's ideas since "notIn" is just a set operator. You just would have needed to sort -u after stemming.. maybe a stem.c to replace the two trs. } So, I still don't think you needed large Random Access Memory. |
|
For posterity, that ni5.c outperforming v7 spell (not counting IO) in 1979 v7 C is this:
It's actually not so bad of a C program, especially if you compare it to Doug McIlroy's A) much larger and B) approximate and C) very English-specific program. While I'm sure there is a little bit of time in pre-processing for v7spell that could be elided, I feel I've confirmed the idea (mentioned in McIlroy1982, actually, at the end of HASHING & before SUPERIMPOSED CODES) that an exact approach could have been fine.McIlroy 82 seemed aware of this, mentioning the Morris-Thompson way as [11], but I'm not sure he was aware of the pretty severe trade-off between more compacting data compression ideas and how fast they are to decode. As you can perhaps see from this program the prefix-delta fits very naturally into the actual data flow of the algorithm and even informs how to intelligently skip a lot of data comparison. It might be possible to do that even better - I didn't think too long about that.
Of course, at some scale dictionary and some scale inputs the loop-over-words will beat whole-document-at-once, but 5500 words (chosen to match a statement in McIlroy82) is not that big. This post alone is about 1250. And 270e3 words is actually a large unabridged dictionary, bigger in fact than the "100-200K words" McIlroy speculated upon at the time.
Since run time is clearly pretty directly proportional to dictionary size, it's fair to say a 150kw dictionary would have been about twice as fast as v7spell and exact and human-language adaptable (at the 5500 word scale or moving the break-even to a lower 2750-ish word document size). Had this idea actually been used at the time, I imagine there would have been pressure to shrink that 870kB encoded size to fit on a 360K floppy drive which would have been 43% the size (and the time, when rehoused on a Winchester drive, though a 2..3 floppy install would also not have been too crazy). Even in the modern age over at https://github.com/wolfgarbe/SymSpell it seems 82,765 words or 31% the size (and run time) are deemed acceptable. So, either triple the v7 spell speed@5500 or an 1800 word break-even seem feasible. (EDIT: And, of course, the preprocessor could always select which of the two is fastest, if you want, based on how many unique words there are)