> There’s a wealth of tutorials, courses, books, and the like about how to write compilers. But if somebody believes that writing a transpiler isn’t fundamentally the same thing as writing a compiler, it may not occur to them to look at any of that material.
The basic argument is this: "compiler" isn't a term that needs to be limited from transforming a high-level input to a low-level output. Any program which is structured to map an AST to another AST, no matter the relative levels of abstraction, can borrow from many of the principles of modern compiler theory, including using many small passes rather than monolithic rewrites.
At a company I worked for, we compiled a high-level declarative expression language of our own devising into SQL and other back-end representations, including English. Is that a "compiler" as many see it? No. However, thinking about the problem like a compiler problem gave us lots of insights into how to architect our product; it let us draw on an existing wealth of knowledge to make improvements quickly and reliably.
Googling "what is a compiler" returns this: "a program that converts instructions into a machine-code or lower-level form so that they can be read and executed by a computer." So, that's something.
It's hard to get more canonical than the Dragon Book. The Dragon Book (2nd Ed.) says this in section 1.2: "Up to this point we have treated a compiler as a single box that maps a source program into a semantically equivalent target program."
So, the most canonical source doesn't include an explicit mention of high-level to low-level (though there may be sources I'm missing). But in my experience, that's definitely the connotation. Otherwise, the term transpiler, which is connoted with not outputting low-level code, never would have arisen.
The Dragon book does not make a distinction between high-level vs low-level output because fundamentally there is none. IMHO it is quite a stretch to claim otherwise because the questionable term "transpiler" exists.
I apologize for not being clearer – I'm not claiming that they're fundamentally different. My original comment, to that point, agrees with the original article's author: compilers are just mappings to and from programs.
re: transpiler – all I'm saying is that the term arose because people associated compilers with low-level outputs.
In summary, there don't appear to be canonical definitions of compilers as having low-level outputs, but for some reason many speak of them that way.
There's also the etymology, ie. the pre-computing dictionary definition: you compile eg. a list, ie. make something smaller/shorter from a larger input. You also write a book when it's an original work, but another author or editor might take parts of yours and other books and compile an anthology. You might translate a book from one language to another, but that's not considered a compilation.
Agreed. I had the same thing writing the parser for my $SHELL. The output was never going to be machine code since the bulk of the code would consist of pipelining external processes and spawning subshells. So the output of "compiler" is an AST-like memory structure with an order of process and tokens for parameters. However I still found following tutorials about compiler design immensely helpful since the problems I faced were largely the same even though the output generated was vastly different.
> At a company I worked for, we compiled a high-level declarative expression language of our own devising into SQL and other back-end representations, including English. Is that a "compiler" as many see it? No. However, thinking about the problem like a compiler problem gave us lots of insights into how to architect our product
Wow. Snap. Except we never got it to compile to English (well, we probably could have but the result would have had deeply nested bracketed clauses...)
That's unfortunately not my experience. I'm appalled every time someone tells me "But Scala.js is not a compiler, it's a transpiler, since it compiles to JS!"
I assume other language users and authors suffer the same kind of comments on a regular basis.
Welcome to the webdev world. It keeps inventing new terms for things that already have established terms for decades, probably because it's rare anyone bothers to look back at the decades of stuff not done in JS...
I'd argue that it's a compiler if it treats JS as a lower language - that the output is simply an intermediate artefact for executing Scala on a JS engine, and not meant for human consumption. If, on the other hand, the output is meant to be developed further by hand, if the output is considered an equivalent and not lower representation, then it's a transpiler.
> If, on the other hand, the output is meant to be developed further by hand, if the output is considered an equivalent and not lower representation, then it's a transpiler.
> It's shorter than writing "source-to-source compiler"
Brevity isn't everything. But more importantly, "transpiler" (like "source-to-source compiler") does not say what you are compiling from, and what you are compiling to. In order for the term "transpiler" to be useful, you need to specify those things. Anything you think you imply by using the term is not, in fact, implied.
If you compare the lengths of "JavaScript to Pascal compiler" and "JavaScript to Pascal transpiler", you might be in for a surprise!
I think the current definition of "transpiler", the way it's used in the wild, means "compiles $something to JavaScript". I haven't seen anyone using this word for anything else than compiling to JS.
You got me thinking. Can a human language like English be defined with an AST? If so, are there examples? If not, why not? I suspect the answer might be, "yes, it's called [this thing I've heard of a thousand times but never considered it to be a language compiler]"
I'd strongly suggest diving into compilers if you've never studied the subject. Learning a bit on the subject unlocks a ton of incredibly useful skills. That knowledge helps you implement stuff like autocomplete, linters, syntax highlighting, etc.
The Super Tiny Compiler [0] is a very gentle introduction to the subject. It's great because it helps you quickly develop an initial mental model.
To give an everyday usage example: I've used jscodeshift [1] many times to safely refactor large amounts of code. In one case, I quickly migrated a project's test assertion library to an alternative which the team agreed was superior. This tool is also typically used by the react team in order to provide a smooth migration path whenever they make changes to the public API.
If I may piggy back on this: I've been writing a tiny optimising compiler in Haskell, to show off the problems and cool ideas that modern compiler toolchains like LLVM have: I'm covering SSA, Scalar evolution, and Register Allocation as a first pass. It's WIP, but [here's the link](https://github.com/bollu/tiny-optimising-compiler)
If people find getting started on a compiler to be a bit too intimidating, one good way to get your feet wet is implementing an interpreter for small subset of a language. Perhaps the basic arithmetic part of adding/multiplying/dividing integers.
I have been reading the T3X book and code and I have to say that I am a huge fan of your work. I haven't finished reading it yet, but I have thoroughly enjoyed it so far! I can do successful 'make test' on Linux and NetBSD, but a trivial T3X program seems to misbehave on NetBSD. I look forward to getting to the bottom of it.
A quick question, if you don't mind: why do t.c and t.t differ slightly in the code emitted for t.memcmp, t.copy and t.fill? Thanks!!
The two compilers differ because I stopped applying non-essential modifications to t.c as soon as it was good enough to bootstrap t.t. There are also some edges cases that t.c does not catch. I think it's best to not use it for anything but bootstrapping!
Let me know about that misbehaving NetBSD program when you find out what caused it -- or even if you don't!
I own a number of the Nils Holm books on language implementation and they are all wonderful in a compact, approachable and real-world way that text books are not. A great companion to any programming language course.
I disagree. I tried that approach for many years but without external input, I could never figure out how to transition from a simple expression language to a proven, working compiler architecture. While large compiler architectures work well for smaller languages, the opposite is not true.
I have found it much better to pick a good introductory text and just work through the exercises.
You can turn an interpreter into a compiler by replacing all code that actually does something by code that prints out the code that does it in the target language. It takes a bit to wrap your head around it, and you won't get an optimizing compiler, but a compiler it will be.
So your values are no longer values in the interpreter's language, but descriptions in the target language for getting that value. To compile an expression, you first handle the sub-expressions, as in an interpreter, which prints the code to compute them. Additionally, you get such a value description for the return value of each expression.
Then you can use those descriptions to print out code to get them into known locations (e.g. registers). Then you can print out code to perform your operation on the values in those locations and put the result in another location (e.g. on the stack). The return value of this compilation step is the description of that location.
I like your comment so please don't interpret my remark as evidence of the contrary. I simply think you misunderstood me.
I didn't mean to say that it's hard to transition from interpreters to compilers, only that I found it hard to transition from interpreting/compiling an arithmetic language to a full structured programming language with general recursion, loops, data structures, variables and so on.
Since I'm not sure how understandable the explanation was, here is an example, a simple interpreter for a Lisp-like language, using Python lists as the syntax tree.
def interpret(state, function_table, expression):
function = function_table[expression[0]]
return function(state, function_table, *expression[1:])
def interpret_all(state, function_table, expressions):
return [interpret(state, function_table, e) for e in expressions]
interpreter_table = {
'+': lambda s, ft, *es: reduce(lambda a, b: a + b, interpret_all(s, ft, es)), # sum
'*': lambda s, ft, *es: reduce(lambda a, b: a * b, interpret_all(s, ft, es)), # product
'!': lambda s, ft, k, v: s.update({k: interpret(s, ft, v)}), # set variable
'?': lambda s, ft, k: s[k], # get variable
';': lambda s, ft, *es: interpret_all(s, ft, es)[-1], # execute a sequence and get the last value
}
An example program for 'x = a * (b + a), return x * x' looks like this:
The compiler can be implemented by swapping out the interpreter_table with different functions to print code instead of executing it:
def fresh_variable(state):
next_tmp = state.get('__next_tmp__', 0)
var = 'tmp_%s' % next_tmp
state['__next_tmp__'] = next_tmp+1
return var
def compile_sum(state, a, b):
var = fresh_variable(state)
print(var+' = '+a+' + '+b)
return var
def compile_product(state, a, b):
var = fresh_variable(state)
print(var+' = '+a+' * '+b)
return var
def compile_set(state, key, value):
state[key] = key
print(key+' = '+value)
compiler_table = {
'+': lambda s, ft, *es: reduce(lambda a, b: compile_sum (s, a, b), interpret_all(s, ft, es)), # sum
'*': lambda s, ft, *es: reduce(lambda a, b: compile_product(s, a, b), interpret_all(s, ft, es)), # product
'!': lambda s, ft, k, v: compile_set(s, k, interpret(s, ft, v)), # set variable
'?': lambda s, ft, k: s[k], # get variable
';': lambda s, ft, *es: interpret_all(s, ft, es)[-1], # execute a sequence and get the last value
}
Then you can easily compile the program:
>>> interpret({'a': '10', 'b': '20'}, compiler_table, program)
tmp_0 = 20 + 10
tmp_1 = 10 * tmp_0
x = tmp_1
tmp_2 = x * x
'tmp_2' # this is the return value, telling you where to find the result
Depending on the syntax of your target language, control flow might require additional compiler state. In Python, I would have to store the current indentation level. If someone is interested, I could show how to do if-statements.
If you are interested in compilers at all, but not for Lisp-like languages, I would recommend studying this self-compiling C-subset compiler and virtual machine very, very carefully: https://news.ycombinator.com/item?id=8558822
As a favor for a friend, I wrote a mini-"compiler" that translated detailed specs for story game scenes into code. I had never done anything similar (my background is more on the Math/Stats side) but I figured hey, what the hell.
There were only a few types of possible scenes, so my first approach was to create a data structure for each type of scene that had a method converting it to code. However, this broke badly with conditional/branching paths that could potentially have arbitrary levels of nesting.
So my next approach was to create multiple passes using some simple recursive data structures "Parseables" where conditional branches could contain other Parseables (including other conditional branches) as well as multiple passes (Text -> Parseable -> Printable -> Code instead of Text -> Screen -> Code.) This worked quite nicely.
Had I realized this was a compiler, I could have probably read some tutorials and not had to do everything from scratch. This would probably have resulted in better engineering, but been a lot less fun.
Don't feel bad, tutorials are for when you don't know how to start. If you have a clue how to approach the next step, having a go based on intuition is a superior strategy. Maybe you'll do something stupid, maybe not; either way you learn which means we all win in the long run.
I wonder if scheme-based compiler courses are still run at Indiana University?
Abdulaziz Ghuloum's 'Incremental compiler construction' [1] also has a working compiler at the end of each stage. For example after the first week you have a compiler that outputs a program that prints a single integer, the 2nd week immediates. The tutorial is at [2]. It doesn't use a nanopass framework, just builds the complexity of the language. It's for a scheme compiler written in scheme.
Apparently Ghuloum was a phd student under Dyvbig who also wrote his own scheme compiler to x86 [3],[4].
I think the nanopass concept - e.g. lots of well defined IRs passing through a pipeline - is a very good idea for teaching, but I'm interested to see how well it performs and also whether, from the programmers perspective, whether this simplifies code or adds unneeded complexity (Specifically, whether they can be optimized quite as well as a big monolithic compiler).
I'm currently writing a compiler framework - nothing big, as a learning project, and I think I shall try integrate this idea in some way. I feel it would be particularly useful for compiler backends like GCC or LLVM because having well defined pipelines etc. means that one can hook into the framework cleanly (Potentially a huge saving in compile times, e.g. not having to cart around unneeded libraries/symbols).
Chez Scheme [0] is written using the nanopass framework, and it's regarded as one of the fastest Scheme compilers in existence [1]. Before it was rewritten to use the nanopass system, Chez's compiler was known for its performance in terms of lines of code compiled per second; the rewrite slowed it down a bit, but the quality and performance of generated machine code improved. Andy Keep and Kent Dybvig wrote a paper about the project [2]. I haven't browsed the Chez source, but it's a good way to answer your question.
Petrashko et al., "Miniphases: Compilation using Modular and Efficient Tree Transformations", PLDI 2017
https://infoscience.epfl.ch/record/228518/files/paper.pdf
talks about how they gained a lot of performance by fusing separate passes. So it looks like the nanopass abstraction (like most abstractions) does have a cost in performance. But ease of compiler implementation should in almost all cases be considered more important, I think.
I suppose if you look at it that way, then anyone who has completed Crenshaw's excellent tutorial series[1] could also claim to have written (approximately) the same number of "compilers".
With a parser combinator library, you write a parser by starting with a bunch of primitive parsers (say, that parse numbers or characters) and combining them, eventually building up the ability to parse a sophisticated language.
That sounds like recursive descent, also a highly recommended method of writing a parser for its simplicity, speed, and ease of error reporting.
But if somebody believes that writing a transpiler isn’t fundamentally the same thing as writing a compiler, it may not occur to them to look at any of that material.
From what I understand, the definition of a transpiler is one which almost exclusively performs syntax-syntax transforms, and doesn't delve into the semantics with e.g. dataflow or control flow. Thus the lack of material about writing "transpilers" --- or rather, someone looking to write one should instead be seeking out information on "search and replace" algorithms.
It is recursive descent. Scala's parser combinator library is beautiful, and has been, hour-for-hour, the most useful tool I have learned as a programmer. Parsing problems became so common once I understood how to parse things.
I'm writing my first compiled DSL right now, inspired by the sense that parser combinators gave me: "maybe you don't have to be a genius to write a compiler."
In addition to parser combinators, another great functional tool for dealing with recursive structures (e.g. abstract syntax trees) is recursion schemes. I've been banging my head against them this week, and I finally made some headway. They are useful for the nanopass technique referenced in the article.
> From what I understand, the definition of a transpiler is one which almost exclusively performs syntax-syntax transforms, and doesn't delve into the semantics with e.g. dataflow or control flow.
> From what I understand, the definition of a transpiler is one which almost exclusively performs syntax-syntax transforms, and doesn't delve into the semantics with e.g. dataflow or control flow.
A counter example to this is the typescript compiler which some might describe as a transpiler. It does sophisticated control flow analysis to, for example, make sure all branches of an if statement return the same type.
There's a decent textbook written in the incremental style argued for by the author: Essentials of Compilation. It features compilers for seven languages, written in Racket, each with successively more advanced features.
I am very fascinated by this notion of teaching compilers by starting at code generation and working backwards from there. I've read about nanopass compilers before, but do they also work well if the compiler is written in a statically typed language? They clearly will for passes that do not change the program representation, but my experience is that it is awkward to define large amount of similar-but-slightly-different representations in a statically typed language.
I think you end up needing macros even in statically typed languages to make it convenient. Even the nanopass framework uses macros to define the different representations.
I've always thought of compilers as usually lossy graph rewriters with input and output data structures usually being 'flatter' in some sense. Maybe a both simplistic and vague model, but it has served me well enough the few times I needed to build one.
This isn't meant to validate or invalidate any other view or definition, but I'm curious if there are any good counter examples or theoretical reasons for characterising compilers differently ?
Quite naturally, all compilers might not be implemented with explicit graphs and their subsequent rewrites, but implicitly both input and output represent a set of statements encoding a specific set of truths and actions, all of them which is intrinsically related to their various contexts.
In this light there is really no distinction between high level and low level targets, but only between various levels of information loss and how explicit the actions and truths are expressed in the input and output.
It might be worth noting that most of the perceived information loss, except for names, is only due to human perception and limitations. State of the art decompilers can in many cases recover a surprising amount of the original types and code structures, albeit at considerable computational costs.
Graph rewriting turns out to be a rather difficult problem, and many optimizations are better recast in other frameworks. For example, instruction scheduling (on superscalar processors) is pretty much a textbook example of the job scheduling problem. Loop optimizations can be most easily expressed in the polyhedral loop optimization model. Decisions like inlining can be framed as high-dimension, highly non-linear, multi-goal optimization (in the mathematics sense) problems.
I can't think of any reason to contradict that as a abstract description of a compiler, although the standard - Practical? - definition of a compiler is probably the more useful of the two.
Graph rewriting is probably a little bit niche, e.g. AFAIK most compilers don't describe what they do as graph rewriting. Perhaps because many compilers do a lot of their work on linear data structures (which are part of a larger graph) like Basic Blocks.
Most compilers sort of work this way, except that they don't have an intermediate format. They take source, turn it into tokens, turn it into a tree of sorts, run several passes over it till the end result is reached (whatever the target is).
My own compilers turn 4 different input languages in the same parser tree (Hand written tokenizers/parsers), do several passes over it until it has a very base set of instructions left, at which point it generates code for whatever backend was picked.
The only big difference is that the process outlined in the article has an actual textual intermediate format, if I understand it correctly. Sounds like a lot of work of extra work with little gain, a simpler approach might be to have "ToString" method working on all nodes on all levels that' outputs info clear enough to understand from within a debugger.
It's not that there's a textual intermediate format: just a defined intermediate tree structure. Some examples are given in the nanopass framework documentation [0]. So it's not that the input language is processed into an intermediate format, which is printed to a string and then read by the next pass; the intermediate languages are all in various forms of trees. See also the paper on writing Chez Scheme as a nanopass system [1].
Anyone got stories about attempts to combine compiler construction with deep learning techniques? As AI related technologies now become realistically implementable, wouldn't the compiler theory be one of the most greatly affected research fields?
I find many people tend to use the terms parser and compiler almost interchangeably, which is a horrid mistake. These terms are not the same and are not related. So, lets get clear on terminology:
* lexer: A scanner. It runs code, typically as a string, through an evaluator and builds pieces based upon known language syntax rules.
* parser: A rule evaluator. Parsers typically use lexers to reason about code and then evaluate the pieces to determine context, relationships, and categories that describes the code structure.
* compiler: A transformer. Compilers change code from one format (syntax) to another different format.
---
> With a parser combinator library, you write a parser by starting with a bunch of primitive parsers (say, that parse numbers or characters) and combining them, eventually building up the ability to parse a sophisticated language.
I do like that part of the article. I am working on a universal language parser right now. To be truly universal you have to accomplish two big goals:
1. parse all the languages
2. seamless interchange between the various different parsers (it is a single parser with interchange between the various lexers)
Seamless interchange is necessary for languages like JSX, which starts as JavaScript, but can contain XML code units that then escape back to JavaScript syntax. Another example is code blocks in markdown documents where the code block can specific a name of the language described by the code block.
Lexers split a program (string) into groups of characters as a list of tokens. Lexical analysis has nothing to do with syntax. Canonically, syntax and grammar are the same thing and parsers deal with that. A parser looks for patterns of tokens that match up to a grammar rule. If all the grammar rules that compose a correct program are parsed, a parser produces an AST that the compiler can act on. Otherwise the parser errors out due to a programmer syntax error.
The boundaries between lexers and parsers differ by the parser in question.
Grammar and syntax are also different, particularly with regards to XML based languages. Syntax are the rules which define the language while grammars are the conventions that define the context in which artifacts in a language instance are interpreted. Whether a language requires terminating semicolons or curly braces are syntax rules. Whether there is an object schema or namespace concern is a grammar issue.
Parsers commonly produce abstract syntax trees (AST), but can produce output in a variety of formats. I prefer parse tables personally. To say that a parser must produce as AST is rather short-sided and inexperienced.
While parsers typically rely upon lexers and compilers typically rely upon parsers there is no law proclaiming computation must occur in that flow. Lexers, parsers, and compilers are all separate steps that can act independently provided a sufficient configuration.
> The boundaries between lexers and parsers differ by the parser in question.
Indeed, the division between lexer and parser is somewhat optional, since scannerless parsers can be defined to operate directly on a stream of basic symbols from the language's alphabet, but the usual division reflects notions from automata theory:
- A lexer (a.k.a., lexical analyzer, tokenizer, scanner) is founded, in principle if not in actuality, on a finite state automaton to group input symbols from an alphabet into basic meaningful units ("lexemes" or "tokens") and to classify those units according to their significance (identifiers, keywords, literals, delimiters, etc.). In a natural language context, it can be thought of as producing words from a sequence of letters.
- A parser is founded, in principle if not in actuality, on (usually and at least) a finite state pushdown automaton, which is basically a finite state automaton equipped with a stack which can be manipulated by the transition function. The goal of a parser is to convert a stream of input symbols into one (or sometimes more) derivations (a.k.a. parse trees, [concrete] syntax trees, parses) which represent the structure of the input in terms of a grammar. In a natural language context, it can be thought of as producing sentence diagrams from a sequence of words (or letters in the case of scannerless parsing).
> Grammar and syntax are also different,
No, grammar and syntax are the same. A grammar consists of an array of productions, which are rules that define how symbols in the language may be combined to form valid sentences in the language. Usually, "grammar" refers to a formalism that is sorta like a big regular expression, except that it's not limited to the regular languages (context-free grammars being most common, with extra-parser hacks to support context-sensitive languages if needed), but a grammar is really an abstract mathematical object and need not be formally manifest.
> Syntax are the rules which define the language while grammars are the conventions that define the context in which artifacts in a language instance are interpreted.
Indeed, syntax can be thought of as the rules which define a language (that is, can be thought of as a grammar). However, what you describe as "the conventions that define the context in which artifacts in a language instance are interpreted" is properly called semantics, which are rules for ascribing meaning to phrases in the language.
> Parsers commonly produce abstract syntax trees (AST), but can produce output in a variety of formats. I prefer parse tables personally. To say that a parser must produce as AST is rather short-sided and inexperienced.
This is true; indeed a parser can directly produce any sort of output: a single truth value that indicates whether the input is part of the language that it recognizes, a translation into another language, or even "no" output in the case of an interpreter. However, I'm confused by your usage of "parse tables" here. The term usually refers to the tables that are used to drive a parsing algorithm (i.e., an input to the parsing algorithm) rather than the output of a parser. Would you elaborate for me?
> Lexers, parsers, and compilers are all separate steps that can act independently provided a sufficient configuration.
Indeed, a parser does not necessarily require a lexer, nor must a parser's output be compiled. However, though your statement is technically true, I can't even imagine what a compiler without a parser might look like!
> There’s a wealth of tutorials, courses, books, and the like about how to write compilers. But if somebody believes that writing a transpiler isn’t fundamentally the same thing as writing a compiler, it may not occur to them to look at any of that material.
The basic argument is this: "compiler" isn't a term that needs to be limited from transforming a high-level input to a low-level output. Any program which is structured to map an AST to another AST, no matter the relative levels of abstraction, can borrow from many of the principles of modern compiler theory, including using many small passes rather than monolithic rewrites.
At a company I worked for, we compiled a high-level declarative expression language of our own devising into SQL and other back-end representations, including English. Is that a "compiler" as many see it? No. However, thinking about the problem like a compiler problem gave us lots of insights into how to architect our product; it let us draw on an existing wealth of knowledge to make improvements quickly and reliably.