Now, I am a grumpy old fart, but I would suggest to make a one-pass recursive descent lexer+reader. That should be trivial for the MAL lisp. WRiting an ocaml recursive descent parser should be amazingly straightforward, especially since you can just tailcall the different states.
is easier than writing a tokenizer manually (keeping track offsets and looping and stuff), writing that regexp is definitely harder than writing the tokenizer like this:
curr, end = 0, len(s)
while True:
while curr < end and isspace(s[curr]):
curr += 1
if curr >= end:
break
if s[curr:curr + 2] == "~@":
yield s[curr:curr + 2]
curr += 2
elif isspecial(s[curr]): # isspecial(c) matches c against []{}()'`~^@
yield s[curr]
curr += 1
elif isquote(s[curr]): # isquote(c) matches c against "
start = curr
curr += 1
# check this condition out: you can totally support several quotes,
# and accurately match the closing and opening ones. Imagine doing it
# with a regexp: either duplicate it, or use some back-referencing magic
while curr < end and not (s[curr] == s[start] and s[curr-1] != '\\'):
curr += 1
curr += 1 # we want to include the closing quote
yield s[start:curr]
elif s[curr] == ';':
yield s[curr:]
break
else:
start = curr
while curr < end and not (isspace(s[curr]) or isspecial(s[curr]) or isquote(s[curr])):
curr += 1
yield s[start:curr]
Yeah, it's more verbose and somewhat repetitive, but on the other hand, it's way more readable, and debuggable too: with regexps, it's always a mystery which part of it exactly didn't match what you wanted or captured something you didn't want to match. Here, the loop invariants and preconditions are almost immediately obvious.