| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kragen 724 days ago

you ask what i mean about programmer productivity. consider this python code from https://norvig.com/spell-correct.html:

    def edits1(word):
        "All edits that are one edit away from `word`."
        letters    = 'abcdefghijklmnopqrstuvwxyz'
        splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
        deletes    = [L + R[1:]               for L, R in splits if R]
        transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
        replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
        inserts    = [L + c + R               for L, R in splits for c in letters]
        return set(deletes + transposes + replaces + inserts)

in seven lines of code, it computes all the potentially incorrect words that can be made from a given correct word with a single edit. so, for example, for 'antidisestableshmentarianism', it returns a set of 1482 words such as 'antidisestauleshmentarianism', 'antidisestableshmentarianlism', 'antidiseitableshmentarianism', 'antidisestablesjhmentarianism', and 'antidiseptableshmentarianism', totaling 42194 bytes. how would you do this in uxntal?

here's another part of norvig's program. this part tabulates the case-smashed frequency of every word in its 6-megabyte training set (which presumably consists only of correctly spelled words):

    import re
    from collections import Counter

    def words(text): return re.findall(r'\w+', text.lower())

    WORDS = Counter(words(open('big.txt').read()))

this takes about 340 milliseconds one core of on my laptop here, which runs at about 6000 MIPS, so it would take about 34 seconds on a machine running at 60 MIPS, maybe a little longer on the apollo3. there are 32198 distinct words in the training set, totaling 244015 characters; the most common word ('the') occurs 79809 times, and the longest word ('disproportionately') is 18 characters. so plausibly you could represent this hash table without any compression in about 500k, though cpython requires about 70 megabytes. ram compression could plausibly get those 500k down to the 384k the apollo3 has without trying to swap to offchip flash

finding the best correction for a word requiring two corrections like 'slowlyyy' takes 70ms, so plausibly it would take 10 seconds or so on the apollo3. (you could maybe do this in the background in a text editor.) (if it were compiled to efficient code, it would probably be closer to 300 milliseconds on the apollo3, because cpython's interpretive overhead is a factor of about 40.) 'disproportionatelyyy' takes 370ms. here's the rest of the correction code:

    def P(word, N=sum(WORDS.values())): 
        "Probability of `word`."
        return WORDS[word] / N

    def correction(word): 
        "Most probable spelling correction for word."
        return max(candidates(word), key=P)

    def candidates(word): 
        "Generate possible spelling corrections for word."
        return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

    def known(words): 
        "The subset of `words` that appear in the dictionary of WORDS."
        return set(w for w in words if w in WORDS)

    def edits2(word): 
        "All edits that are two edits away from `word`."
        return (e2 for e1 in edits1(word) for e2 in edits1(e1))

note that this requires you to have two such `edits1` sets in memory at once, though you could plausibly avoid that problem by tolerating more duplicates (double letters provoke duplicates in deletes, transposes, and replaces)

norvig doesn't tell us exactly how long it took him to write the code, but he did it in a single airplane flight, except for some minor bugs which took years to find. more importantly, though, it's very easy code to read, so you can easily understand how it works in order to modify it. and that's the most important factor for programming productivity

here are some things in this code that are more difficult to write and much more difficult to read in uxntal:

- managing more than 64k of data (uxn's memory addresses are 16 bits)

- dynamically allocating lists of things such as the (left, right) tuples in splits

- dynamic memory allocation in general

- string concatenation

- eliminating duplicates from a set of strings

- iterating over the words in a text file

- generating a sequence of outputs from a sequence of inputs with a filtering predicate and a transformation function [f(x, y) for x, y in xys if p(x, y)]

- generating a lazy flat sequence of outputs from a nested loop (return (z for y in f(x) for z in f(y)))

- hash tables

- incrementally eliminating duplicates from a sequence of candidates that turn out to be valid words (set(w for w in words if w in WORDS))

- counting the number of occurrences of each string in a lazy sequence of strings

- floating-point arithmetic (which would be fairly easy to eliminate in this case, but not in many other cases; this deficiency in uxn is especially galling since the apollo3 has fast hardware floating point)

- finding the highest-rated item of a lazy sequence of candidates according to some scoring function

and all of that is on top of the general readability tax imposed by postfix syntax, where even figuring out which arguments are being passed to which subroutine is a mental challenge and a frequent source of bugs

note that these are mostly not deficiencies you can really patch with a library. i didn't mention that the program uses regular expressions, for example, because you can certainly implement regular expressions in uxntal. they're things you probably need to address at the level of language semantics, or virtual machine semantics in the case of the address space problem. and they're not tightly tied to cpython being implemented grossly inefficiently; pypy implements all the necessary semantics, and common lisp and c++ have similar facilities in most cases, though their handling of lazy sequences is a weakness that is particularly important on hardware with limited memory like the apollo3

so that's what i mean when i say that uxn is designed to make easy things hard, rather than making hard things easy

you say:

> the pain point might be intentional nudges away from making things the designer doesn't like

the thing is, i don't really care whether rek and devine think that autocorrecting misspellings is a bad thing to do; i want the computer to be a means of expression for my ideas, indeed for everyone's ideas, not for the ideas of a singular designer. that's the apple walled-garden mindset, and it's anathema to me. and, though i could be wrong about this, i think rek and devine would probably agree

1 comments

kazinator 724 days ago

TXR Lisp:

  (defun edits1 (word)
    (hash-list (build
                 (each ((i 0..(len word)) (j 1))
                   (let ((le [word 0..i]) (rj [word j..:]) (ri [word i..:]))
                     (add (join le rj))               ;; deletes
                     (each ((c "a".."z"))
                       (add (join le c rj))           ;; replacements
                       (add (join le c [word i] rj))) ;; prefixes + inserts
                     (or (empty le) (empty ri)        ;; transposes
                         (add (join [le 0..-1] [ri 0] [le -1] [ri 1..:])))))
                 (each ((c "a".."z"))
                   (add (join word c))))))            ;; suffixes

link

kragen 724 days ago

this doesn't look bad at all! considerably better than common lisp, in particular. but i think the flatter structure of the python improves readability, and the independence of the different clauses facilitates interactive incremental testing:

    >>> word = 'the'
    >>> letters    = 'abcdefghijklmnopqrstuvwxyz'
    >>> splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    >>> splits
    [('', 'the'), ('t', 'he'), ('th', 'e'), ('the', '')]
    >>> deletes    = [L + R[1:]               for L, R in splits if R]
    >>> deletes
    ['he', 'te', 'th']

but lisps are generally pretty good at that kind of thing, so i imagine you could formulate it slightly differently in txr lisp to support that kind of thing (i just don't know txr lisp)

as a semantic question, is this materializing the whole list (as the python does) or are the `add` calls inserting into the hash table as the loops run, thus eliminating duplicates?

link

kazinator 724 days ago

I had a bug somewhere, so I selectively off some of the add expressions. They can be commented out with #; or by flipping add to list or identity to throw away the value.

The add is something which pairs with build. Lisp doesn't have "bag-like" lists. For those times when they are helpful, we can have procedural list building syntax. The build macro creates an environment in which a number of operators like add that build up or otherwise operate on an implicit list. When the build form terminates, it returns the list. (Its sister buildn returns the last form like progn).

In this function, I could just have used (push expr stack) because we don't care about the order; there would be no nreverse. That would be a better idea, actually.

We could also add the strings to a table directly, like (set [h (join ...)] t).

The hash table is built by the hash-list call. It associates the elements of the list with themselves, so if "a" occurs in the list, the key "a" is associated with value "a".

link

kragen 724 days ago

thanks! it sounds like a pretty effective system, although the form of incremental development you're describing is editing and rebuilding the program, more like c than the kind of repl flow i was talking about

the use of [] for indexing rather than clojure's list building (or as conventional superparentheses) is appealing

what does buildn do with the built list?

link

kazinator 723 days ago

In TXR, we can easily recall the entire function definition at the REPL, and resubmit it, without requiring any external IDE.

Furthermore, if you have the kind of workflow where you have individual REPL commands produce results that are used by subsequent commands, the TXR Lisp listener has good support for that. When you recall an input line from history, you can use Ctrl-X Enter to execute it rather than just Enter. When you use Ctrl-X Enter, it will keep the history position and move to the next line in history rather than return to the current context. So using Ctrl-X Enter multiple times, you can resubmit a sequence of historic lines in order.

link

kragen 721 days ago

that's pretty nice!

you might want to switch to the standard gnu readline keybinding for 'submit the current input line for execution and recall the next line in history', which is control-o. aside from the synergistic effect of being able to use the same keybinding in txr, bash, python, etc., it's a command which you frequently want to use several times in a row, and binding such a command to a sequence of two keystrokes makes it disproportionately more clumsy. you may have noticed recent versions of emacs permit you to run a keyboard macro repeatedly with c-x e e e, and it's a huge usability improvement

link

kazinator 724 days ago

What buildn will do with the list is simply lose it. The last form can extract it and do do something with it.

When you might use it is when the goal isn't to build a list which escapes. The construct supports queuing semantics (insert at one end, take from the other), so you can use buildn to express a breadth-first traversal that doesn't return anything, or returns something other than the list:

  (defun bf-map (tree visit-fn)
    (buildn
      (add tree)
      (whilet ((item (del)))  ;; (del) from front
        (if (atom item)
          [visit-fn item]
          (each ((el item))
            (add el))))))     ;; (add) to back

link