| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by PaulHoule 516 days ago

It's easy to compress a sorted dictionary by turning something like

  a
  ab
  abc

  a
  1b
  2c

where the prefix is the number of characters shared with the last word (and might be coded as a byte as opposed to a digit like you're thinking there. This would be a simple form of compression to code up in BASIC unlike Huffman or most LZW variants which involve bit twiddling and maintaining tree data structures... Though I do remember writing SQ and USQ [1] in BASIC)

[1] https://techtinkering.com/articles/compression-and-archiving...

1 comments

cb321 503 days ago

In fact, I think this smart merge technique would have been faster (and exact, with no approximations!) than what McIlroy 1982 describes on 1975..1980 hardware in spite of not fitting in memory. The end of that article says it took about 65..77 seconds of CPU on a PDP 11/70 to check a 5500 word document.

Meanwhile, the approach of the below 3 files (which would need only tiny adaptations to 1975 C) should have taken about 4 seconds at 200 kB/sec IO time (see below). This kind of technique was quite common in the 1960s database community also. So, it's a bit weird for the Bell labs guys to not have known about it, and it shows off the C/Unix system just as well. Here is an example implementation:

delta.c:

    #include <stdio.h>  /* stdin->out: delta encode sorted lines */
    #include <string.h>
    struct { unsigned nPfx:4, nSfx:4; } __attribute__((packed)) nPS;
    
    int main(int ac, char **av) {
        char word[64], prev[64];
        int  nPrev = 0;
        while (fgets(&word[0], sizeof word, stdin)) {
            int n = strlen(word), nPfx;
            int m = n < nPrev ? n : nPrev;
            for (nPfx = 0; nPfx < m; nPfx++)
                if (prev[nPfx] != word[nPfx]) break;
            nPS.nPfx = nPfx;
            nPS.nSfx = n - nPfx - 1;
            fwrite(&nPS, 1, 1, stdout);
            fwrite(&word[nPfx], 1, n - nPfx - 1, stdout);
            nPrev = n - 1;
            memcpy(prev, word, n - 1);
        }
        return 0;
    }

words.c:

    #include <stdio.h> /* Emit all words in stdin, 1 to a line */
    #include <ctype.h> /*~tr -cs A-Za-z \\n|tr A-Z a-z|grep -vE .{16,} */
    
    int main(int ac, char **av) {
        char wd[15];
        int  n, c;
    #define PUT fwrite_unlocked(wd,1,n,stdout); putchar_unlocked('\n')
        while ((c = getchar_unlocked()) >= 0) { /* accum word chs */
            if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')) {
                if (n < sizeof wd)          /* make smarter & .. */
                    wd[n++] = tolower(c);   /*..contraction aware. */
                else
                    fprintf(stderr, "too long: %15.15s...\n", wd);
            } else if (n > 0) { /* non-word c & have data */
                PUT; n = 0;     /* put & reset */
            }
        }
        if (n > 0) { PUT; }     /* put any final word */
        return 0;
    }

notIn.c:

    #include <stdio.h>  /* Emit stdin words not in /tmp/dict */
    #include <string.h> /* Both sources must be delta-encoded */
    struct { unsigned nPfx:4, nSfx:4; } __attribute__((packed)) nPS;
    
    int main(int ac, char **av) {
        FILE *fI = stdin, *fD = fopen("/tmp/dict", "r");
        char  wI[16], wD[16];       /* Word buffers; then lens */
        int   nI=0, nD=0, oI, oD;   /* Flag saying last read was Ok */
    #define GET(T) /* update [wno][ID] with its next word */ \
        if (o##T = fread(&nPS, 1, sizeof nPS, f##T)==sizeof nPS) { \
            n##T = nPS.nPfx + nPS.nSfx; \
            o##T = o##T && fread(&w##T[nPS.nPfx], 1, nPS.nSfx, f##T);}
    #define PUT { fwrite(wI, 1, nI, stdout); putchar('\n'); GET(I) }
        GET(I); GET(D); /* set theory "not in": ordered merge impl */
        while (oI && oD) {
            int c = memcmp(wI, wD, nI < nD ? nI : nD);
            if (c == 0) c = nI - nD;
            if (c < 0) PUT                   /* I<D: emit&advance I */
            else if (c == 0){GET(I);GET(D);} /* I==D: advance both */
            else GET(D);                     /* I>D: advance D */
        }
        while (oI) PUT                       /* flush tail out */
    } /* (echo a;cat $SOWPODS) | delta>/tmp/dict # DE UN-abridged dict
         words < Input | sort -u | delta | notIn # Extract,DE,Check */

To back up my 4.2 seconds at 200 kB/sec, you can find any SOWPODS dictionary and it compresses to about 837 KiB with that delta.c. 837/200 = 4.185.

If the rejoinder is "that SOWPODS stuff had/has licensing trouble" then no problemo - just use whatever in house dictionary they used and stemming / auto-suffix junk and use that to synthesize an exact dictionary. Then you can correct it as you go and et voila. In fact, if you wanted to be even faster than this 15..20X faster and then make accuracy-perf trade-offs, then you could probably shrink IO by generating the delta-encoded stream directly from the suffix rules.

I'd recommend staying exact, though. In this case, it seems a bad algo idea led them to be both inaccurate and slower and the writing was on the wall that hardware was getting faster. And honestly, my SOWPODS dictionary that seems 15X faster may well have better coverage than what they did which means an at-the-time apples to apples might have been 20..25x faster.

As a kind of data-compression / next optimizations side-note, the 267751 SOWPODS I compressed to 857065 bytes this way can only be Zstd -19'd down to about 588040 bytes. It's all mono-case and with simply a array-of-5-bits encoding, you could honestly get that 857065 down to (857065-267751)5+26775110 bits = 703010 bytes, less than 1.2X bigger than Zstd -19, but only 1.21X better than the more naive encoding above. So, you know, simple delta encoding works like gangbusters on a sorted spelling dictionary, has a very simple set membership algo (as above), and seems like it was at least an order of magnitude faster than what they actually did instead. I'd be a bit surprised if no one pointed this out at the time.

link

cb321 501 days ago

Actually, @probably_wrong was actually right above. It took a while, but I found Bentley 1986 Programming Pearls Column 13 describes Kernighan & Plaugher 1981 as describing the original v6 spell in more detail as indeed the smart merge technique:

    concat ..|translit A-Z a-z|translit ^a-z @n|sort|unique|common -2 wds

For the curious, in modern parlance that Unix v6 spell is:

    tr A-Z a-z | tr -c a-z \\n | sort -u | comm -2 -3 - someDict

So, A) mea culpa & this corrects the record and B) my measurements above suggest that @Paul_Houle's delta encoding idea alone would have given ~4X IO scan reduction on unabridged with no word stemming accuracy trade-offs. { Delta coding is, of course, combinable with word stemming in the same way as McIlroy's ideas since "notIn" is just a set operator. You just would have needed to sort -u after stemming.. maybe a stem.c to replace the two trs. } So, I still don't think you needed large Random Access Memory.

link

cb321 497 days ago

I was curious if "approximate was even faster than exact" at the time in 1975..78. So, I learned a bit how to get a v7 booting on a SIMH pdp11 simulator. I actually got an ordered merge against @Paul_Houle's delta-against-the-prefix encoded format mention to run faster than the v7 spell. Here is a little transcript:

    $ time zk
    Total: 500400000
    real        2.0 user        2.2 sys         0.1
    $ echo suw = 1st 5500 sorted-unique words of Tale Of Two Cities
    $ wc < suw
       1531   1531   11043
    $ time comm -23 - sowpodsAIO < suw >/n
    real     3:04.0 user     2:28.2 sys        35.4
    $ time ni0 sowpodsAIO.pd < suw.pd > /n
    real     2:01.0 user     1:48.5 sys        11.4
    $ time ni1 sowpodsAIO.pd < suw > /n
    real     1:41.0 user     1:29.1 sys        11.3
    $ time ni2 sowpodsAIO.pd < suw > /n
    real     1:26.0 user     1:14.1 sys        11.5
    $ time ni3 sowpodsAIO.pd < suw > /n
    real     1:04.0 user       53.2 sys        10.3
    $ time ni4 sowpodsAIO.pd < suw > /n
    real     1:04.0 user       53.3 sys        10.1
    $ time ni5 sowpodsAIO.pd < suw >/n
    real       45.0 user       34.4 sys        10.1
    $ time spell < suw > /n
    /bin/spell: /usr/dict/spellhist: cannot create
    real       46.0 user        1.4 sys         1.4
    $ wc answer.spell
         78     78     604 answer.spell
    $ wc answer.ni2
         35     35     225 answer.ni2

zk is Bob Supnik's classic PDP-11 performance test. `comm -23` is the slow part of the v6 Unix spell. `ni0.c` is the same as the above `notIn.c`. `ni1.c` specializes to have stdin be non-prefix-delta but the dictionary still prefix-delta and does the simple optimization of not recomparing already compared prefix bytes. `ni2.c` does the remaining simple (also not English-specific) optimization of skipping over long words part of an "increasing run" like "abbreviate abbreviated abbreviates abbreviating .." on the way to "abide", say. `ni3.c` ditches stdio buffering for the dictionary in favor of its own to play with the buffer size since v7 could only do BUFSIZ=512 bytes. `ni4.c` does the same for the stdin stream and also eliminates the need for a follow-on strlen (like the modern getline/getdelim). Finally, `ni5.c` manually inlines the character reading loop into the macro to get the next dictionary word. In real life, there would probably be 2-4 seconds of IO time added for all but the `comm` variant (which needs 2.7 MB IO not 870 kB).

For posterity, that ni5.c outperforming v7 spell (not counting IO) in 1979 v7 C is this:

    #include <stdio.h>  /* Emit stdin words not in /t/dict.pd */
    extern int errno;   /* ni dict.pd < sort-uWds */
    #ifndef Z
    #define Z 2048
    #endif
    char buf[Z]; char *b = buf; int n = 0;
    fillbuf(fd) int fd; {
        int m = read(fd, buf, Z);
        if (m > 0) { b = buf + 1; n = m - 1; return buf[0]; }
        return -1;
    }
    #define chGet(fd) (n-- > 0 ? *b++ : fillbuf(fd))
    char bufI[Z]; char *bI = bufI; int In = 0;
    filbufI() { /* symbols must be <= 7 chars long */
        int m = read(0, bufI, Z);
        if (m > 0) { bI = bufI + 1; In = m - 1; return bufI[0]; }
        return -1;
    }
    #define chGetI (In-- > 0 ? *bI++ : filbufI())
    readw(wI, max) char *wI; int max; {
        register int  i;
        register char c;
        for (i = 0; i < max; i++) {
            c = chGetI;
            if (c != '\n' && c != -1)
                wI[i] = c;
            else
                break;
        }
        return i;
    }
    main(ac, av) int ac; char **av; {
        int  fD = open(av[1], 0);
        char wI[64], wD[16], nPS;   /* Word buffers, lens, then.. */
        int nI=0, nD=0, oI=1, oD=1; /* ..a flag sez read was Ok. */
        int m, nS, nP, i;   /* Running num. matching prefix chars */
        if (!fD){fprintf(stderr, "%s: %d\n", av[1],errno);return 1;}
    #define DGET    /* Update [wno]D with its next word */ \
        if ((oD = oD && (nPS = chGet(fD)) != -1)) { \
            /* pcc needs & 255 here| , but runs ~8% slower */ \
            nP = ((unsigned)nPS)>> 4; nS = ((unsigned)nPS) & 15; \
            nD = nP + nS; \
            for (i = nP; i < nD; i++) wD[i] = chGet(fD); }
    #define IGET    /* Update [wno]I with its next word */ \
        m = 0;      /* New inp word => reset cmp to start */ \
        oI = oI && (nI = readw(wI, sizeof(wI)));
    #define PUT { fwrite(wI, 1, nI, stdout); putchar('\n'); IGET }
        IGET DGET   /* Set Theory "not in": ordered merge impl */
        while (oI && oD) {
            register int c = 0, lim = nI < nD ? nI : nD;
            if (nP < m) m = nP;
            for (/**/; c == 0 && m < lim; m++) c = wI[m] - wD[m];
            m--;
            if (c == 0)
                c = nI - nD;
            if (c < 0) PUT                  /* I<D: Emit&AdvanceI */
            else if (c == 0) { IGET DGET }  /* I==D: advance both */
            else {                          /* I>D: advance D to..*/
                if (nD < nI) {              /*..a maybe <= word. */
                    DGET    /* Loop & compare more */
                } else {    /* Cannot possibly be <= while nP > m */
                    do { DGET } while (oD && nP > m);
                } /* The above trickery elides like 95% memcmp */
            }
        }
        while (oI) PUT                      /* flush tail out */
        return 0;
    }

It's actually not so bad of a C program, especially if you compare it to Doug McIlroy's A) much larger and B) approximate and C) very English-specific program. While I'm sure there is a little bit of time in pre-processing for v7spell that could be elided, I feel I've confirmed the idea (mentioned in McIlroy1982, actually, at the end of HASHING & before SUPERIMPOSED CODES) that an exact approach could have been fine.

McIlroy 82 seemed aware of this, mentioning the Morris-Thompson way as [11], but I'm not sure he was aware of the pretty severe trade-off between more compacting data compression ideas and how fast they are to decode. As you can perhaps see from this program the prefix-delta fits very naturally into the actual data flow of the algorithm and even informs how to intelligently skip a lot of data comparison. It might be possible to do that even better - I didn't think too long about that.

Of course, at some scale dictionary and some scale inputs the loop-over-words will beat whole-document-at-once, but 5500 words (chosen to match a statement in McIlroy82) is not that big. This post alone is about 1250. And 270e3 words is actually a large unabridged dictionary, bigger in fact than the "100-200K words" McIlroy speculated upon at the time.

Since run time is clearly pretty directly proportional to dictionary size, it's fair to say a 150kw dictionary would have been about twice as fast as v7spell and exact and human-language adaptable (at the 5500 word scale or moving the break-even to a lower 2750-ish word document size). Had this idea actually been used at the time, I imagine there would have been pressure to shrink that 870kB encoded size to fit on a 360K floppy drive which would have been 43% the size (and the time, when rehoused on a Winchester drive, though a 2..3 floppy install would also not have been too crazy). Even in the modern age over at https://github.com/wolfgarbe/SymSpell it seems 82,765 words or 31% the size (and run time) are deemed acceptable. So, either triple the v7 spell speed@5500 or an 1800 word break-even seem feasible. (EDIT: And, of course, the preprocessor could always select which of the two is fastest, if you want, based on how many unique words there are)

link