Hacker News new | ask | show | jobs
by lindig 4054 days ago
A lexer for a language with a lot of keywords leads to a large representation as an automaton as the author has experienced. One way to deals with this is to only recognise in one rule all identifiers including keywords (something like "[a-z_][a-zA-Z0-9_]*" and to use a hash table of keywords to check whether a match is a keyword (and which one) or an identifier .

Edit: fixed the regexp to allow for single-char identifiers.

1 comments

> One way to deals with this is to only recognise in one rule all identifiers including keywords (something like "[a-z_][a-zA-Z0-9_]+" and to use a hash table of keywords to check whether a match is a keyword or an identifier (and which one).

A lot of production parsers I've seen do this with two state machines. They'll have a top-level state machine (usually hand-implemented using a giant switch statement) that handles all of the main different token types—numbers, strings, punctuators, etc. That has a single type for all identifiers, including keywords.

Then, after it determines a token is a identifier, it runs another little tiny optimized state machine to see if it's a reserved word or not.