1. I started with a primitive lisp interpreter written in C++ and worked hard on exposing C++ functions/classes to my lisp using C++ template programming. LLVM is a C++ library, the C bindings are always behind the C++ API. So exposing the C++ API directly gave me access to the latest, and greatest API. That means you need to keep up with LLVM - but clang helps a lot because API changes appear as clang C++ compile time errors. I've been "chasing the LLVM dragon" (cough - keeping up with the LLVM API) from version 3.something to the upcoming 13.
2. I wrote a Common Lisp compiler in my primitive lisp that converted Common Lisp straight into LLVM-IR. I didn't want to develop my own language - who's got time for that? So I just picked a powerful one (Common Lisp) with macros, classes, generic functions, existing libraries, a community etc.
3. I used alloca/stack allocated variables everywhere and let mem2reg optimize what it could to registers. I exposed and used the llvm::IRBuilder class that makes generating IR a lot easier.
4. Then I picked an experimental, developing compiler "Cleavir" written by Robert Strandh and bootstrap that with my Common Lisp compiler. It's like that movie "Inception" - but it makes sense :-).
Now we have a Common Lisp programming environment that interoperates with C++ at a very deep level. Common Lisp stack frames intermingle perfectly with C++ stack frames and we can use all the C/C development, debugging and profiling tools.
This Common Lisp programming environment supports "Cando" a computational chemistry programming environment for developing advanced therapeutics and diagnostic molecules.
We are looking for people who want to work with us - if interested and you have a somewhat suitable background - drop me a message at info@thirdlaw.tech
Not LLVM expert, but I don't agree with some of your arguments.
> My side of the code generator had to recognize when a variable had already been defined and keep track of its pointer
For human, it's natural to write code text that reference each variable by its name. However, for a compiler, it's really error prone (and inefficient) to reference a variable by its string name (for example, think about shadowing). The natural way to reference an entity is by its object pointer, which is what LLVM does. This is especially true considering LLVM is designed to perform various complex transformations.
> There is a pass called mem2reg that will convert to SSA, but it needs you to allocate and store variables in memory (instead of in registers).
The purpose of mem2reg is to make your job easier. It's weird to say that it "needs" you to allocate allocas for your variables: that's what it allows you to do (for your own convenience). If you prefer to generate PHI nodes directly, you can just do so.
> LLVM IR has opinions about variable scope
Not sure what you are referencing to. LLVM only has 'alloca', which knows nothing about "scope". It must be defined before being referenced -- but this is true for everything in SSA.
I've also gone down the lex/parse/llvm rabbit hole. The op didn't write llvm is clueless or unstructured; she writes llvm could have a better user manual. c++ is my meal ticket; llvm can do nice stuff certainly.
I definitely agree that it would be better if LLVM has a more flattened learning curve and a more accessible manual.
I'm only pointing out that many "problems" listed in the post are intentional design choices for good reasons. They are not downsides that should be improved.
> I'm only pointing out that many "problems" listed in the post are intentional design choices for good reasons. They are not downsides that should be improved.
That's not really less of a problem. If you can't tell why the designers made a choice and what the purpose and intention was, and there's no documentation about that, then your project has failed pretty catastrophically on communication. That's not really a less severe failure or an easier to fix problem than failing technically. Although, I suspect a lot of developers would reflexively disagree.
I agree that it's very important to clearly communicate the designs to the users. However, I feel that it's practically hard for many reasons. For example, it's hard to argue in the document why something is not done (e.g. why are entities referenced by pointer instead of string name). And also keep in mind that the document is read by people of various degrees of expertise. It's hard to make all of them happy.
In fact, when I first started using LLVM, I created a basic block and put everything into it. Then LLVM complains, and I learnt that a basic block must be a list of straight-lined operations followed by a branch. At that moment I was feeling similar to the author in this post: "Why is there such a bizzare rule that makes me harder to write my program?" But after I was forced to rewrite my code to conform with this rule, surprisingly, I found my program logic much easier to understand. And when I gradually knew a bit more about LLVM, I understood that the basic block rule is there for other very good reasons as well.
So is this good design? Clearly it is. This design decision not only helps with the library itself, but also forces the users to write code in a less error-prone way.
However, is it possible to justify this in the document, so another user won't have to go through my initial frustration? I am doubtful. At least my personal feeling is that, I won't be able to understand why it is designed this way, unless I actually have played around and experienced it myself.
I hope this clarifies my point on why sometimes it is hard to communicate design philosophies.
> Each of the major steps have tools available that will do 90% of the work for you. On the lexer/parser side there’s ANTLR4, bison, yacc, flex.
In my experience, those tools will do like 10% of the work of lexing or parsing for you, and you will spend equivalent to 20% of the work understanding how to use them and integrating them. And then you'll find out a sad hand-written recursive descent parser is faster in practice and is what e.g. GCC and clang use.
It's true that a hand-written recursive parser is better for a real compiler, but the main reason is that it's much easier to write sensible error messages that way. Parser generators are only really good at the happy path.
The article aptly describes what it's like to start working with essentially any large code project, be it open source or proprietary. Unfortunately you are never going to see a project with comprehensive documentation, I'm not sure what that would even look like.
Good, maintainable code therefore becomes an important part of the documentation and being realistic about this from the start probably improves your "reference" or "manual", where it's better to focus on high-level or architectural concepts, and link to where to find the nitty-gritty in source.
>Unfortunately you are never going to see a project with comprehensive documentation, I'm not sure what that would even look like.
It probably would look like Tex, FWIW.
Edit: I move to the bookcase to look at TeX: The Program. There it is sitting next to my volumes of TAOCP. A reminder of my failures. I can almost feel the disaproving gaze of D. E. Knuth. I'm not worthy.
I empathize with the author's struggle and the pain of having to use the C++ API to generate LLVM IR. It's not relevant to Go, but the OCaml LLVM bindings are kept up-to-date and the documentation is there, though there's very little tutorial material to be found. Still, I find it much cleaner and nicer to use than C++.
Trying to generate LLVM IR from scratch seems like a lost cause; when you realize how much the library code is keeping track of for you to make it possible to emit correct LLVM, you know that replicating all that just isn't worth it.
Also found LLVM to be pretty poorly documented - often what's out there is out of date and incomplete. The sheer scale of it makes it hard to narrow down what you're looking for too - I've resorted to searching the source code a few times to see how something works.
I love its multi-language, multi hardware target abilities, and wicked fast compiled code - but its complexity, glacially slow compiles, and sloppy documentation are currently a drag.
The documentation is a joke last time I tried to use it. The author should have tried libgccjit instead. It lacks LLVMs full capabilities, but atleast it is possible to get a grasp around the whole API and read it fully.
I had a not terrible time emitting LLVM IR text directly as part of an exploration of language backends.
Here are the three parts:
* Introduction: https://notes.eatonphil.com/compiler-basics-llvm.html
* Conditionals: https://notes.eatonphil.com/compiler-basics-llvm-conditionals.html
* And system calls: https://notes.eatonphil.com/compiler-basics-llvm-system-calls.html
The hardest part I can remember is figuring out how LLVM IR's embedded assembly works since it's not exactly like Clang or GCC's IIRC. And the documentation was definitely confusing.
I think the libraries wrapping LLVM IR are frankly harder to figure out than emitting the IR text directly.
The idea of a "nice" language for SMT solvers is really cool!
Off the top of my head, I think Boogie [0] is pretty similar to OP's plan of building a language that reduces to an SMT query, so it may be worth checking out.
However, I'm not sure that LLVM optimizations will really "allow people to write more robust and complex models". I mean, they might help, but they might also make things worse. LVM tries to make code run fast on a Von Neumann architecture, sacrificing the structure of the input program in the process. But, for an SMT solver, preserving structure may be better than optimizing, for a number of reasons (but basically because solver heuristics may find a solution faster given such additional information).
I’ve recently started down the rabbit hole of building a video player + editor from scratch and this feels so relatable!
Lots of stumbling around, reading scarce and outdated resources, and finding that really not many people have written about this stuff and it’s “easier”/necessary to dive into the source of projects that do similar things. I spent a solid day mapping out how ffplay.c works to try to figure out how to synchronize audio and video properly. I have no background in video but I’m learning as I go, and things are falling into place, and it’s been pretty fun most of the time.
But I definitely resonate with the feeling that, if/when I get this thing working, I won’t really know if it’s “correct”, and I also won’t know how much that affects anything. It’s like one of those infinitely zoomable fractal images, there’s always some higher level of detail than the one you can currently see!
The author tries to figure out bravely some of LLVM IR concepts and get some of them right and some wrong (like mem2reg purpose). While I do not want to discurage this sort of exploration learning I want to point out that what he is clearly lacking is some CS fundamentals. Perhaps taking some compiler construction classes from here https://www.classcentral.com/search?q=compiler could made learning LLVM easier.
I also second a good point about LLVM examples and documentation being heavy on C/C++ API. I was also generating IR code from other language and found this C/C++ API focus annoying.
1. The LLVM API is designed as a C++ API, and if you're serious about using LLVM, you're likely to have to actually work with the C++ API directly. There's a C API which is theoretically more stable than the C++ API, but it is very heavily gimped--it has basically no support for metadata, for example--and is mostly feasible only for the most basic usage entirely. Since the author brings up needing to use custom metadata, that suggests that they are intending to create custom optimization passes which is basically impossible except via the C++ API.
2. The complaint about metadata was very strange to me. I have had to work with custom metadata very recently with my work in LLVM, and I've had nothing like the pain the author suggests. (I've also had to deal with TBAA, which is definitely an area where LLVM lacks sorely in documentation, particularly examples). The "defined before use" just simply isn't an issue, because metadata is supposed to be global, so there is no define or use...
I took a look at the llir library the author was using. On a quick inspection, it appears to be a library for generating textual LLVM IR without having to link to LLVM at all. Oy. The problem isn't LLVM, nor even the LLVM IR itself. The problem is your library to generate LLVM IR.
3. About the SSA issue. LLVM actually does have facilities to generate SSA correctly without going through allocas (though that might be challenging to use for codegen instead of in the context of an optimization pass). But, as established above, the author is purposefully using LLVM in a way that precludes them from availing themselves of this feature. Note that LLVM specifically recommends that frontends generate variables as allocas in the entry block and letting the optimizer generate the SSA for you (see https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/LangI...).
4. I'm not entirely sure what the author means when discussing variable scope, but my guess is they neglected the "in the entry block" part of the standard guidelines for generating variables. If true, I'm left scratching my head where they got their answer to the SSA issue from that didn't mention that part--it's a very important part of generating alloca's correctly, and getting it wrong means you have some very broken mental semantics as to how it's supposed to work.
5. From the final paragraph, it seems the author's final step is to... write a parser for LLVM IR, and then convert their custom-parsed LLVM IR into SMTLIB2 code. As opposed to having LLVM parse the IR itself, visiting that IR, and then doing the same. Just... no.
This isn't to say that LLVM is perfect in terms of documentation--it is very far from it--but a lot of the issues seem to be related to trying to actively avoid working with LLVM itself.
From what I'm seeing, the author skipped reading the documented LLVM source code in favor reading of a completely undocumented Go port (reimplementation?) of one part of LLVM?
They also seemed to have misunderstood what the level of abstraction LLVM's IR provides.
Did they miss the forest for the trees? I'd like to think I'm wrong. :/
1. I started with a primitive lisp interpreter written in C++ and worked hard on exposing C++ functions/classes to my lisp using C++ template programming. LLVM is a C++ library, the C bindings are always behind the C++ API. So exposing the C++ API directly gave me access to the latest, and greatest API. That means you need to keep up with LLVM - but clang helps a lot because API changes appear as clang C++ compile time errors. I've been "chasing the LLVM dragon" (cough - keeping up with the LLVM API) from version 3.something to the upcoming 13.
2. I wrote a Common Lisp compiler in my primitive lisp that converted Common Lisp straight into LLVM-IR. I didn't want to develop my own language - who's got time for that? So I just picked a powerful one (Common Lisp) with macros, classes, generic functions, existing libraries, a community etc.
3. I used alloca/stack allocated variables everywhere and let mem2reg optimize what it could to registers. I exposed and used the llvm::IRBuilder class that makes generating IR a lot easier.
4. Then I picked an experimental, developing compiler "Cleavir" written by Robert Strandh and bootstrap that with my Common Lisp compiler. It's like that movie "Inception" - but it makes sense :-).
Now we have a Common Lisp programming environment that interoperates with C++ at a very deep level. Common Lisp stack frames intermingle perfectly with C++ stack frames and we can use all the C/C development, debugging and profiling tools.
This Common Lisp programming environment supports "Cando" a computational chemistry programming environment for developing advanced therapeutics and diagnostic molecules.
We are looking for people who want to work with us - if interested and you have a somewhat suitable background - drop me a message at info@thirdlaw.tech