Hacker News new | ask | show | jobs
by bjoli 2080 days ago
Now, I am a grumpy old fart, but I would suggest to make a one-pass recursive descent lexer+reader. That should be trivial for the MAL lisp. WRiting an ocaml recursive descent parser should be amazingly straightforward, especially since you can just tailcall the different states.
1 comments

While re-using an existing regexp like

    [\s,]*(~@|[\[\]{}()'`~^@]|"(?:\\.|[^\\"])*"?|;.*|[^\s\[\]{}('"`,;)]*)
is easier than writing a tokenizer manually (keeping track offsets and looping and stuff), writing that regexp is definitely harder than writing the tokenizer like this:

    curr, end = 0, len(s)
    while True:
        while curr < end and isspace(s[curr]):
            curr += 1

        if curr >= end:
            break
            
        if s[curr:curr + 2] == "~@":
            yield s[curr:curr + 2]
            curr += 2

        elif isspecial(s[curr]):        # isspecial(c) matches c against []{}()'`~^@
            yield s[curr]
            curr += 1

        elif isquote(s[curr]):          # isquote(c) matches c against "
            start = curr
            curr += 1

            # check this condition out: you can totally support several quotes,
            # and accurately match the closing and opening ones. Imagine doing it
            # with a regexp: either duplicate it, or use some back-referencing magic

            while curr < end and not (s[curr] == s[start] and s[curr-1] != '\\'):
               curr += 1
            curr += 1        # we want to include the closing quote
            yield s[start:curr]

        elif s[curr] == ';':
            yield s[curr:]
            break

        else:
            start = curr
            while curr < end and not (isspace(s[curr]) or isspecial(s[curr]) or isquote(s[curr])):
                curr += 1
            yield s[start:curr]
Yeah, it's more verbose and somewhat repetitive, but on the other hand, it's way more readable, and debuggable too: with regexps, it's always a mystery which part of it exactly didn't match what you wanted or captured something you didn't want to match. Here, the loop invariants and preconditions are almost immediately obvious.
Not only that: having a generator generate the tokens means you can do it in one pass, while writitng code that has the clarity of 2 passes.