| HN Mirror

While re-using an existing regexp like

    [\s,]*(~@|[\[\]{}()'`~^@]|"(?:\\.|[^\\"])*"?|;.*|[^\s\[\]{}('"`,;)]*)

is easier than writing a tokenizer manually (keeping track offsets and looping and stuff), writing that regexp is definitely harder than writing the tokenizer like this:

    curr, end = 0, len(s)
    while True:
        while curr < end and isspace(s[curr]):
            curr += 1

        if curr >= end:
            break
            
        if s[curr:curr + 2] == "~@":
            yield s[curr:curr + 2]
            curr += 2

        elif isspecial(s[curr]):        # isspecial(c) matches c against []{}()'`~^@
            yield s[curr]
            curr += 1

        elif isquote(s[curr]):          # isquote(c) matches c against "
            start = curr
            curr += 1

            # check this condition out: you can totally support several quotes,
            # and accurately match the closing and opening ones. Imagine doing it
            # with a regexp: either duplicate it, or use some back-referencing magic

            while curr < end and not (s[curr] == s[start] and s[curr-1] != '\\'):
               curr += 1
            curr += 1        # we want to include the closing quote
            yield s[start:curr]

        elif s[curr] == ';':
            yield s[curr:]
            break

        else:
            start = curr
            while curr < end and not (isspace(s[curr]) or isspecial(s[curr]) or isquote(s[curr])):
                curr += 1
            yield s[start:curr]

Yeah, it's more verbose and somewhat repetitive, but on the other hand, it's way more readable, and debuggable too: with regexps, it's always a mystery which part of it exactly didn't match what you wanted or captured something you didn't want to match. Here, the loop invariants and preconditions are almost immediately obvious.