Hacker News new | ask | show | jobs
by BrandonM 6569 days ago
For a class this quarter, I was writing a compiler from scratch in C. For the lexer, we had to have a table mapping current state and current character to next state. Moreover, it was necessary to figure out what states would be needed to produce the desired tokens.

Of course, knowing the language beforehand, one could come up with the necessary states and then produce the table by hand, but that is labor-intensive and error-prone, and having the language we were parsing hard-coded into my lexer code did not satisfy me.

My first approach was to figure out the states by hand, and then, rather than creating the transition table by hand, I defined a simpler structure near the top of my code which was an array of

{ current_state, next_state, "characters-resulting-in-this-transition" }

arrays. Then, at runtime, I would have to perform initialization to translate this structure into the actual transition table itself. It was at this point that I really wished for the macro power of Lisp, to have the full power of the language at compile time to define structures that would remain static at runtime.

As an aside, I ended up writing Python (to figure out both the necessary states and to build the transition table) that generated the C code, and then I added items to my Makefile that would run the Python to create the C and then compile it. If I was using Lisp, I could have done this in a much more straightforward manner.

1 comments

Not to denounce the educational aspect of doing it all, but it would seem the most straightfoward way to do this would be to use lex and yacc, which implement DSLs specifically designed to do this. Using a general purpose language seems like overkill (which is why parser generators exist).
The very last assignment was to use lex and yacc to compile the same language that we had spent the whole quarter writing a compiler and simulator for. The point of the class ("Compiler Design and Implementation") was not to learn the quickest way to write a compiler, but to learn about the common methods used in writing compilers (REs (finite automata) for the lexer, and LL, SLR, and LALR parsers for the compiler). Since this is precisely how lex and yacc work, someone coming out of the class would have the knowledge to port lex and yacc to another language or to perhaps even improve upon it.

As for saying that lex and yacc "implement DSLs", that is a huge simplification of what they do. A large part of the work of lex and yacc is generating the tables and the engine code that are used to translate characters to tokens and token sequences to parser rules. Thanks to the work we did in using this same approach ourselves, it was much easier for me to understand what lex and yacc do and why various design decisions were made.

'As for saying that lex and yacc "implement DSLs", that is a huge simplification of what they do.'

The DSLs I'm referring to are the lex input format and the yacc input format, both of which are specific languages for the domains of lexical analysis and parser generation respectively.

Ahh... I thought you meant that lex and yacc are typically used to implement DSLs.