|
|
|
|
|
by Joker_vD
2079 days ago
|
|
While re-using an existing regexp like [\s,]*(~@|[\[\]{}()'`~^@]|"(?:\\.|[^\\"])*"?|;.*|[^\s\[\]{}('"`,;)]*)
is easier than writing a tokenizer manually (keeping track offsets and looping and stuff), writing that regexp is definitely harder than writing the tokenizer like this: curr, end = 0, len(s)
while True:
while curr < end and isspace(s[curr]):
curr += 1
if curr >= end:
break
if s[curr:curr + 2] == "~@":
yield s[curr:curr + 2]
curr += 2
elif isspecial(s[curr]): # isspecial(c) matches c against []{}()'`~^@
yield s[curr]
curr += 1
elif isquote(s[curr]): # isquote(c) matches c against "
start = curr
curr += 1
# check this condition out: you can totally support several quotes,
# and accurately match the closing and opening ones. Imagine doing it
# with a regexp: either duplicate it, or use some back-referencing magic
while curr < end and not (s[curr] == s[start] and s[curr-1] != '\\'):
curr += 1
curr += 1 # we want to include the closing quote
yield s[start:curr]
elif s[curr] == ';':
yield s[curr:]
break
else:
start = curr
while curr < end and not (isspace(s[curr]) or isspecial(s[curr]) or isquote(s[curr])):
curr += 1
yield s[start:curr]
Yeah, it's more verbose and somewhat repetitive, but on the other hand, it's way more readable, and debuggable too: with regexps, it's always a mystery which part of it exactly didn't match what you wanted or captured something you didn't want to match. Here, the loop invariants and preconditions are almost immediately obvious. |
|