|
|
|
|
|
by bluetech
4049 days ago
|
|
[From all the bad things I hear about PHP, the code is very readble without any previous experience - nice]. Here are some things a lexer for a programming language might have to deal with: 1. Comments (some even do nested - which means regular expressions are out for that). 2. Continuation lines. 3. Includes (if done at the lexical level). 4. Filename/line/column number for nice error messages (can really hurt with branch mispredictions). 5. Evaluation of literals: decimal/hex/octal/binary integers, floats, strings (with escapes), etc. 6. Identifiers. So matching keywords is mostly the straightforward part. However I have found that matching many keywords is the perfect (and in my experience so far, the only) use case for a perfect hashing tool like gperf - it would normally be much faster than any pointer-chasing trie. gperf mostly elminated keyword matching from the profile of any lexer I've done. |
|
Some languages allow escapes before everything else. Looking at you, Java. So either you need to do a pass beforehand to unescape them, or unescape characters (and in the process do error handling / etc!) on-the-fly.