Hacker News new | ask | show | jobs
by fizixer 3849 days ago
There is a problem but you're looking at it the wrong way.

What's in the compiler's machine code generation phase that the build system needs to know about? If nothing, then making a monolithic system is only going to make your life miserable.

Well-designed compilers are already split into (at least) two subsystems: frontend and backend. Frontend takes the program and spits an AST (very roughly speakign, although there is also the semantic-analysis phase, and intermediate-representation-generation phase). Backend is concerned with code generation. What your build systems needs is just the frontend part. Not only your build system, but also an IDE can benefit greatly from a frontend (which as one of the commenter pointed out, results in wasteful duplication of effort when and IDE writer decides to roll his/her own language parser embedded in the tool).

I think AST and semantic-analyzer are going to play an increasing role in a variety of software development activities and it's a folly to keep them hidden inside the compiler like a forbidden fruit.

And that's the exact opposite of monolith. It's more fragmentation, and splitting the compiler into useful and resusable pieces.

(At this point I have a tendency to gravitate towards recommending LLVM. Unfortunately I think it's a needlessly complicated project, not the least because of being written in a needlessly complicated language. But if you're okay with that it might be of assistance to your problems)

2 comments

I think it is uncontroversial that many tools would benefit from access to the frontend. But what is the interface to the frontend? There is tremendous diversity among frontends; it's the key distinguishing feature of most languages. What exactly are you going to expose? How should the build system interact with this?

LLVM is a great example of a modular compiler which is a pain to program against, because it is being constantly being refactored with BC-breaking changes. As silvas has said, "LLVM is loosely coupled from a software architecture standpoint, but very tightly coupled from a development standpoint". <https://lwn.net/Articles/583271/> In contrast, I can generally expect a command mode which dumps out Makefile-formatted dependencies to be stable across versions. Modularity is not useful for external developers without stability!

> What exactly are you going to expose? How should the build system interact with this?

The case of AST is relatively clear. You create a library module, or a standalone program, that takes source code and generates an AST. The AST could be in JSON format, s-expression (that would be my choice), heck even XML. It's better if the schema adheres to some standard (I think LLVM AST can be studied for inspiration, or even taken as standard) but even if it's not, that's a problem orthogonal to that of monolithic vs modular. Once you have an source-to-AST converter, it can be used inside a compiler, text-editor, build system, or something else. These are all "clients" of the source-to-AST module.

I'm not too sure about semantic-analysis since I'm studying it myself at the moment. All I can say at the moment (without expert knowledge of formal semantics) that once AST is used in more and more tools, semantic analysis would follow, and hopefully conventions would emerge for treating it in a canonical fashion. Short of that, every "AST client" can roll their own ad-hoc semantic-analyzer built-into the tool itself. Note that it would still be way more modular than a monolithic design.

> I think AST and semantic-analyzer are going to play an increasing role in a variety of software development activities

It would be fantastic if source control systems would work on the AST instead of the plain text files, so many annoying problems could be solved there.

There are simpler and less intrusive ways to solve outside-AST issues (like an automatic reformatting pass before each build or each diff)

What you're suggesting raises a bunch of new (non-trivial) issues:

- What would you do with code comments? Things like "f(/+old_value+/new_value)".

- How to store code before preprocessing (C and C++) ?

- How to store files mixing several languages (PHP, HTML, Javascript) ?

- How do you store code for a DSL?

> - What would you do with code comments? Things like "f(/+old_value+/new_value)".

Comments are included in the AST, the AST should be reprojectable into canonical plaint text.

> - How to store code before preprocessing (C and C++) ?

This could get tricky, punt. cdata

> - How to store files mixing several languages (PHP, HTML, Javascript) ?

Same file format, different semantics. PHP is a DSL.

My new language manifesto includes having a mandatory publicly defined AST.

If it is reprojectable into canonical plain text, it's not really an AST - just an ST.
If version control stored refactorings (tree operations) against the AST it would open up a whole new world of possibilities.
What kinds of problems you talking about.
One really obvious one is merge conflicts in the following pattern:

    #old_file.py
    ...
    def func_before(*params):
        do_things()


    def func_after(*params):
        do_other_things()

    ...
If someone adds a function between func_before and func_after, and their coworker adds a different function also between func_before and func_after, you get a merge conflict because the line-based VCS doesn't know how the new functions should be ordered (or, potentially, interleaved ;) ). An AST-aware version control system could[1] realize that the order of function definitions is meaningless, and then wouldn't need to ask for help resolving the conflict.

[1] This isn't always true, so this might still involve some degree of being specialized-to-the-language. Or it might just mean that the AST the VCS worked with allows for fairly complicated types, and can distinguish between order-sensitive things and order-insensitive things.

Function definition order is certainly meaningful, at least to a human reader. Suppose co-worker 1's function is related to func_before, and co-worker 2's function is related to func_after, and the VCS flipped them around.

Or alternatively, I go in by myself and re-arrange the order of functions in a file to improve the clustering. If the VCS thinks the order of functions is meaningless, it won't recognize the change.