Hacker News new | ask | show | jobs
by hasbot 679 days ago
So what's different about writing a compiler in 2024 than say 10, 20, or 30 years ago? When I started writing compilers in the 80's and 90's lex/flex and yacc/bison were popular. ANTLR came out but I never had a chance to use it. Everything after lexing and parsing was always hand rolled.
6 comments

There are actually quite a few changes!

The most obvious change you'll see is the use of SSA, which has become the dominant representation in IR starting 25-30 years ago.

There's also been an increase in the importance of compiler IRs, and especially the concept of code passing through multiple IRs before reaching machine code.

Formal semantics has become more of a thing in the past decade or so. It's now routine that even weak memory models have a detailed formal model of how they work. In LLVM, it's now a requirement that you demonstrate formal correctness of new InstCombine transformations (which are essentially peephole optimizations).

The use of parser generators has really fallen into disrepute; everything has transitioned to handrolled parsers these days. Language standards themselves are starting to rely on context-sensitive keywords, which are hard to implement in a generator-based lexer/parser setup.

Optimizations have generally broadened in scope; we're now seeing whole-function level of optimization being the default way to look at stuff for analysis, and there's been a variety of techniques introduced to make whole-program optimization (aka LTO) much more tractable.

Another subtle but major shift is that compilers are increasingly reliant on inferred analysis from dumber representations over the original declared intent of the code. For example, SROA in LLVM (which breaks up structs so that individual fields can be independently allocated in registers) relies not on looking at the way the struct is declared but the offsets within the struct that various load and store operations use.

A final major shift is the trend towards support for ancillary tooling in the programming language space, so that the compiler isn't merely a tool that goes from source to object code, but is something that can be actively queried for information about the source code. Things like the language server, or smart formatting, or automatic refactoring tooling.

This is a fantastic comment, thanks.
You can use the "classic" tool set and parse many programming languages. The real trick in writing compilers is taking advantage of the hardware (disclaimer: I designed and wrote middle-pass portions for IBM 360/370 processors (and clones), supercomputers from Cray and ETA Systems, and Sun 4 workstations among others, including some RISC systems that just disappeared around the bursting of the dot-com bubble) I tried and failed to optimize MPP systems from all of the "name" players in the 1990's. It kind-of broke my heart...

That "middle-pass" approach that will let you address many targets is still valid; the trick is finding a sufficiently robust and flexible internal representation at the right level. You also have to be able to out-guess the chip vendors where before you could go to the architect or a complete "System" book and get the real scoop, including things you shouldn't do. Oddly enough, there is simultaneously useful and completely worthless documentation scattered about the internet.

You might want to take a look at Muchnick and Jones' _Program_Flow_Analysis_ (yes, it's from 1981) but chapters 4-6 can be applied at code-generation time. How that fits modern Intel processors (for example) is unknown. Idealizing your processor as a RISC-V might be a reasonable way to proceed but in the end, you'll have to translate the code for the target -- it will be reasonably straight-forward if you drive it all from tables but it's not trivial.

The book doesn't have 2024 in the title. I suspect they put it there because last time a post about this book was made, I noted that it was from 2022, not realizing that the book has now been released in 2024.

https://news.ycombinator.com/item?id=40940799

> So what's different about writing a compiler in 2024 than say 10, 20, or 30 years ago?

As far as I can tell, the main difference is that static single assignment (SSA) as an intermediate form was not the norm 30 years ago, but it is nowadays. Also, in newer books, it's more common to go over global register allocation now, whether that's graph coloring or linear scan register allocation. If you read old compiler books, the main optimizations they talk about are use-def chains, moving computations out of loops, and using the local and tree-based Sethi-Ullman register allocation algorithm.

Compiling today might be done at run time, exploiting dynamic information. The line between a compiler and an interpreter is blurred.
ANTLR generate a visitor system from a BNF grammar that you need to implement the logic of in your chosen language, which I believe is similar to C++'s std::visit. lex and yacc generate state machines using BNF grammar and you implement the logic in the tools themselves
Parser-generators were always academic projects that had little relevance to making real-world programming languages -- where parsing is very easy to write, and necessarily benefits from doing it (ie., you can get better error handling/etc.).

Today most languages are front-ends for LLVM IR, but LLVM is very slow and takes a long time to optimize. Many new languages target x86/arm directly with their own weakly optimized backends, and output an LLVM IR for "release builds".

yacc/bison was used for lex, bc, pcc, gcc, original awk, the bsd pascal compiler, eqn, m4 (!), and many other languages. it's still used for pcc, oawk, mawk, pari/gp, and units. that's just what i have sitting around in my downloads directory

and, while we're talking about ocaml, ocaml does use ocamllex and ocamlyacc for its own parser

so, while you can certainly do without parser generators, they have very commonly been used for making real-world programming languages. almost every programming language anyone here has ever heard of was first implemented with a parser generator. the main exceptions are probably fortran, cobol, algol, lisps, c, and pascal

I guess I wasn't very clear. I didnt mean to say, as a historical matter, were irrelevant.

I meant to say the idea of a parser generator is a solution to a problem that that real world langs don't really have. When writing a programming language, your issue isnt how much time the parser is going to take to write, or how complex it's going to be. The parser is a relatively trivial part of the problem.

Due to language designers often being taught to develop langauges in this fashion, many have relied on these tools. But the modern view of compliers as "programming langauge UIs" and the focus on DX, i'd argue its actively pathological to use a parser generator.

Much academic work has, til recently, focused on these areas -- whereas today, the bulk of the difficulty is in understanding SSA/LLVM/ARM/Basic Optimizatiosn/etc. details which are "boring, circumstantial" etc. and not really research projects. I was just pointing this out since a lot of people, myself included, go down the EBNF parser-generator rabbit hole and think inventing a langauge is some formal exercise -- when the reality is the opposite: it's standard programming-engineering work.

oh, i agree that parsing is not the hard part of writing a compiler, and that compilers classes overemphasize it

but no language starts out as a 'real world lang'; every language is initially a toy language, and only becomes a 'real world lang' in the unlikely case that it turns out to be useful. and parser generators are very useful for creating toy languages. that's why virtually every real world lang you've ever used was implemented first using a parser generator, even if the parser you're using for it now is handwritten

having a formally defined grammar is also very helpful for dx features like syntax highlighting and automated refactoring

The one thing parser generators do is that they ensure that the language you implement actually matches the grammar you wrote. That’s still an important assurance to have.
they also tell you where the grammar is ambiguous, which can be useful
I've created many programming languages. All the ones I finished and were useful did not have grammars that "I wrote".
do you mean https://github.com/mjburgess/Lyssa and https://github.com/mjburgess/Quazar? it's true that i can't find a grammar in either of them (https://github.com/mjburgess/Lyssa/blob/master/src/impl.py#L... seems more forthy than anything else) but (while they are very much the sort of things that i like, thank you for sharing) they also seem somewhat less like 'real-world programming languages' than things like awk, ocaml, or our other example in this thread, gcc c and objective-c until gcc replaced the bison parser with a handwritten one in 02006. that last compiler was the compiler nextstep was built on, which got steve jobs back in control of apple, to replace macos with nextstep. seems pretty real-world to me

maybe you're talking about stuff you haven't released?

Are you saying your programming languages don’t have a defined grammar?
This still isn't quite correct. Back in the day parsing was a much larger portion of the complexity of a compiler: performance was much more of a concern, as was memory usage. Parser generators were an attempt at helping with that, by allowing programmers to produce more efficient (e.g. table-driven) parsers than what they could have otherwise written by hand. They only really went out of fashion because A) computers got faster and bigger faster than programs got longer, so parsing became less and less of a percentage of total utilization, and B) you can get much better error messages out of recursive-descent parsers.
Fortran compilers did go through a phase when table-driven parsers were used, but it had the disadvantage of needing complicated lexers and statement classifiers that rely on semantic information. Fortran’s a hard language to parse, given its lack of reserved words, optional spaces, and many ambiguities.

The f18 compiler’s parser uses parser combinations to construct a backtracking recursive descent parser that builds a parse tree for the whole source file before doing any semantic analysis. This approach allows good error recovery without having to revert any updates to the symbol table.

that's an interesting approach! though probably not applicable to c and c++

i assume that by 'parser combinations' you mean parser combinators

what i meant about fortran is that the first fortran compiler didn't use a parser generator

When your computer was anemic, and could barely do the tasks required for it, eking out a few percent — or a 2x! — from an optimizer was important.

Now-a-days, the difference between "big compiler optimized" and "little compiler not optimized" can be quite dramatic; but, is probably no more than 4x — certainly within range of the distinction between "systems programming language" and "high tuned JITted scripting language". I think most people are perfectly fine with the performance of highly-tuned scripting languages. The result is that all of the overhead of "big compiler" is just ... immaterial; overhead. This is especially true for the case of extremely well-tuned code, where the algorithm and — last resort — assembly, will easily beat out the best optimizer by at least an order-of-magnitude, or more.

Wouldn’t just a simple case of bad code generation render little compiler into a toy one?

Repeating others, today’s compilers are really just “optimizing compilers”, there is no room for toying in production environments.

>just a simple case of bad code generation render little compiler into a toy one

If you find some time to go through gcc bugzilla you'll find shockingly simple snippets of code that miscompiled (often by optimization passes), with fixes never backported to older versions that production environments like RHEL are using.

I realized with all the rhel systems I’m using, we are never using default toolchains on them. Just use those old systems to run stuff, even newer toolchains.
Ok, but that’s sw engineering issue.

I still insist that a production grade compiler can’t leave performance on table. Which is where the current battlefield is.

I think a production grade compiler not only can, but must, leave performance on the table when the cost is correctness (unless the performance gain is incredibly high and the correctness loss is minimal). Correctness is not all important, but it is the most important thing. Unfortunately, compiler writers do not agree and they do silly things like "let's assume UB cannot ever happen and optimize based on that".
I do not agree in the general case. There are very useful DSL compilers which do not consider performance at all, but just compile to a target which does the optimization for them (JVM, LLVM IR or even just C)
if you aren't running on the gpu you're leaving 80+% of your computer's performance on the table. no optimizing compiler is going to make your legacy c or lisp or rust code run efficiently on the gpu, or even in most cases on a multicore cpu. nor, as thechao points out, can it compete with assembly-language programmers for simd vectorization on the cpu

in summary, optimizing compilers for c or pascal or zig or rust or whatever can only be used for code where considerations like compatibility, ease of programming, security, and predictability are more important than performance

probably the vast majority of production code is already in python and javascript, which don't even have reasonable compilers at all

How come you made this about computer’s performance :)
because computer performance is the topic of https://news.ycombinator.com/item?id=41255695, https://news.ycombinator.com/item?id=41255365, and https://news.ycombinator.com/item?id=41255195, the three levels of parent comments in the thread, one of which was by you
> Parser-generators were always academic projects

Were they? GCC abandoned bison in favour of their own parser relatively recently.

"relatively recently" as in around 20 years ago, in GCC 3.x according to sources I found.
gcc was first released in 01987, but it didn't replace its bison parser for c until gcc 4.1 https://gcc.gnu.org/gcc-4.1/changes.html which was in 02006 https://gcc.gnu.org/releases.html, only 18 years ago, and 19 years after it was released. joseph myers first proposed doing this in 02004 https://gcc.gnu.org/legacy-ml/gcc-patches/2004-10/msg01969.h...

so gcc has literally been using a parser-generator-generated parser for c for more than half its existence, at which point it had already become the most popular c compiler across the unix world and basically the only one used for linux, which had already mostly annihilated the proprietary unix market. it was also imposingly dominant in the embedded space

and i think that kind of development path is fairly typical; getting a parser up and running is easier with a parser generator, but it can be tricky to get it to do strange things when that's what you want (especially with lalr, less so with peg)

Could you give some specific examples of those new languages with their own backends for faster builds?
I was immediately thinking of Jai, Zig, .. but I've seen a few
Correct me if I'm wrong, but I think both Zig and Jai use LLVM as their default backend...at least, that's what I have seen via live streaming for Jai, and from Zig's repo.
Zig is moving away from LLVM (https://github.com/ziglang/zig/issues/16270) and Rust has added Cranelift as a debug backend (https://lwn.net/Articles/964735/).

Not sure about Jai.

I don't think it's moving away, because if you have read the entire thread, the gaming community reacts negatively to say the least and they assure everyone that LLVM does not going anywhere any time soon.