Hacker News new | ask | show | jobs
by brundolf 1021 days ago
> Instead, we'll be single-pass: code generation happens during parsing

IIRC, C was specifically designed to allow single-pass compilation, right? I.e. in many languages you don't know what needs to be output without parsing the full AST, but in C, syntax directly implies semantics. I think I remember hearing this was because early computers couldn't necessarily fit the AST for an entire code file in memory at once

7 comments

Linked from another thread: http://cm.bell-labs.co/who/dmr/chist.html

It explains the memory limits and what happened :)

> After the TMG version of B was working, Thompson rewrote B in itself (a bootstrapping step). During development, he continually struggled against memory limitations: each language addition inflated the compiler so it could barely fit, but each rewrite taking advantage of the feature reduced its size. For example, B introduced generalized assignment operators, using x=+y to add y to x. The notation came from Algol 68 [Wijngaarden 75] via McIlroy, who had incorporated it into his version of TMG. (In B and early C, the operator was spelled =+ instead of += ; this mistake, repaired in 1976, was induced by a seductively easy way of handling the first form in B's lexical analyzer.)

Infix parsing chews up a remarkable amount code and memory.

It's scary just how much easier it is to parse languages without infix parsing.

where it takes up the memory in the human brain, where we have more limited working set, than in computers. That is probably why postfix notation is mostly a thing of the past by now in languages intended to be used by humans.
(in (is "prefix notation" (and "alive" "kicking")) "all lisps")

    (-<> "alive"
         (and "kicking")
         (is "prefix notation" <>)
         (in "all lisps"))
But creates a habit of having exactly two individual items for an operation.

Which led us to imperative programming and object oriented programming. Which everybody now recognizes as limited in various artificial ways.

Which we are now struggling to shake off given that the majority of chips have a whopping number of parallel cores.

Programming languages restrain your thinking about your solution space.

I wonder why =+ is so obviously a mistake. It does look vaguely wrong for some reason, but I’m prejudiced by current languages.
>I wonder why =+ is so obviously a mistake

Consider

  x = -5
now how about:

  x =- 5
One makes x negative 5, the other subtracts 5 from it. And under this design of the operator the only difference is a space.

It's the same with +, just that we're not used to seeing unary plus in the wild.

>> x =- 5

I think you mean: x -= 5, which indeed is different.

I'm showing how it would be under the original operator design which was reverted, which was "=-".
I think because it's ambiguous with unary plus (a = +b), since C isn't supposed to have significant whitespace in most circumstances.
You also run into problems with a=*p and a=-b, which are perhaps more likely.
But they could've fixed that by going a=(*p) and a=(-b);

Kind of how we use (*p)->next instead of *p->next where p is node_t**

That seems a little backwards and barbaric, though right?

Imagine if we had to watch out for this as a common pitfall:

    // BUG! Actually subtracts x from current val of neg_x.
    neg_x = -x;
Even moreso, how would these two lines behave? Would they differ in semantics?

    n = -5
    n =- 5
Overall, -= is just so much less ambiguous.

EDIT: To your point about ->, I personally think C would be better if:

    *p->next
parsed as:

    (*p)->next
instead of:

    *(p->next)
but maybe now I'm not thinking through all the parsing impliciations :)
But originally there wasn't a += or =+ lexical token! Those things were scanned as separate tokens, and could be written = +.

So that is problematic; If {a}{=}{-}{b} means a = a - b, then you have no way to write a = -b without parentheses like a = (-b).

You're exactly right. This makes for a small, memory-efficient compiler. But this entails a lot of compromises that we're not willing to put up with anymore, because there's no longer a reason to.
I'm not sure, haven't looked at the codebases of old compilers in a long time. Definitely a lot of the language is pretty amenable to it, especially if you have unstructured jumps for e.g. the for advancement statement. I had a distinct feeling while writing the compiler every time I added a new feature that "wow, the semantics work exactly how I'd like them to for ease of implementation."

Compare that to, say, Rust, which would be pretty painful to single-pass compile with all the non-local behavior around traits.

What you are saying is true for a naive C compiler.

Once you want to optimize or analyse, things become more complicated.

> Compare that to, say, Rust, which would be pretty painful to single-pass compile with all the non-local behavior around traits.

Type inference also spans a whole function, so you can't do it in a single pass through the code. (But it's still tamer than in eg Haskell, where type inference considers your whole program.)

In C, nothing you have not parsed yet (what it to the right) is necessary for what you've already parsed (what lies to the left). (Necessary for type checking it or translating it.)

E.g. to call a function later in the file, you need a prior declaration. Or else, an implicit one (possibly wrong) will be assumed from the call itself.

This is not true in some C++ situations, like class declarations. In a class definition there can be functions with bodies. Those are inline functions. The functons can freely refer to each other in either direction. Type checking a class declaration therefore requires all of it to be parsed.

A one pass language is advantageous even if you're building a serious multi-pass compiler with optimization. This is because that exercise doesn't require an AST! Multi-pass doesn't mean AST.

Building an AST doesn't require just more memory, but more code and development work: more complexity in the code to build the abstraction and to traverse it. It's useful if you need to manipulate or analyze the program in ways that are closely related to the source language. If the language cannot be checked in one pass, you might need one; you wouldn't want to be doing the checking on an intermediate representation, where you've lost the relationship to the code. AST building can be reused for other purposes, like code formatting, refactoring, communicating with an IDE for code completion and whatnot.

If the only thing you're going to do with an AST is walk it up and down to do some checks, and then to generate code, and you do all that in an order that could have been done without the AST (like a bottom-up, left to right traversal), then it was kind of a waste to construct it; those checks and generation could have been done as the phrase structure rules were parsed.

I once read that this is why the MSVC compiler didn't support two-pass template instantiation until very recently: the original compiler implemented templates almost like a macro that re-emitted a stream of tokens with the template parameters replaced.
I can't say if that was a design goal, but it sure looks like it. That's also the way to avoid scaling compiler memory use to program size.

At first I thought that it wasn't possible for C. After I thought about it, as long as you disallow forward references, and rely on a single source file as input, it's possible to compile a complete C program in one pass. Anything else requires a preprocessor (e.g "#include") and/or linker (e.g. "extern" and prototypes) to solve. The implementation in the article dodges all of these and focuses on a very pure subset of C.

I think this goal may also have shifted over time. I remember when I learned C, we used c89, which required declaring all local variables at the top of the block. This seemed like a weird/arbitrary requirement at the time (and is no longer required in later versions), but it makes a lot of sense in a single-pass context! It would allow the stack frame for the current function to be fully sized before any other logic is compiled
I think one thing some early compilers did was read the source serially in one pass and write the output serially in one pass. If you were doing multiple passes you had do that for each pass. That means your compiler speed is IO bound. So one pass is faster.

My cousins ex had a workflow in the late 70's that involved two floppy drives and a little dance to compile and link. Later he got a 5M hard drive which improved things a lot.

using recursive descent, you don't need to build an ast
the call stack during recursive descent is an ephemeral ast, for the recursive descent parsers I've written.
That implies the memory usage is proportional to the depth of the AST, not the total size (which is the point, I think).
Only if the compiler doesn't do anything beyond basic peephole optimizations.