Hacker News new | ask | show | jobs
by _hardwaregeek 2171 days ago
This article is bugging me for some reason. I don't disagree with it. It's certainly easier to make a language now, even more so than in 2016 when this article was published. But it feels like saying "it's easier to make a bridge than ever". That statement isn't wrong. I'd much rather make a bridge with modern technology. But it's still a damn hard task.

And citing parsing isn't a great example. Parser generators have been around for ages. And they're usually not the hard part anyways. Defining a simple grammar and parsing it, even manually, isn't that terrible of a task. Getting decent error messages and figuring out recovery? That's trickier.

Code generation has certainly gotten easier. But you still need to go through the process of figuring out how to lower your abstractions. My language is still extremely basic but I've still had to map my high level types and control structures down to WebAssembly. LLVM won't do that for you.

There's also more that your average user expects if you want a language that people use. Decent tooling is important, so a language server and some syntax highlighting packages in different editors. Good error messages. Decent type inference. Most of these you can eschew in the first few iterations of your language but eventually you'll need them.

I feel bad criticizing this post because writing a language has been one of the most instructive experiences I've had. I've learned so much about code generation, typechecking, the WASM spec, etc. But it's still a lot of tough work to get to something people can use. I'm not sure parser generators and LLVM make it that much easier.

4 comments

A compiler is an error reporting tool with a code generation side-gig.
I'm not sure that's true in reality - the code generation is what you ship in practice. You can't skip on that feature.
I am being flippant, but what the user of the compiler uses it for the vast majority of the time is to get feedback on whether they've missed something. The vast majority of the code in the Rust compiler is dedicated to error reporting. Efficiency of the generated code is important, but it is a well explored space. Emitting good diagnostics is hard and unless you make it a priority it will always be subpar. Communicating with the user to help them get to that generated binary is at least as important, and in my eyes it is more important.

I also consider parsing and syntax to be the least interesting part of any language, as it is also a solved problem that requires little effort (modulo malformed code recovery) in both design and implementation when put in contrast with the rest of the compiler's functions and language design space.

I've started down the road of language design recently and have committed to roughly these goals:

1. Making "Algol 2020" is only interesting if you also throw person-decades of effort into making it competitive in production. I want to do other things with my time, therefore anything I attempt should not lead to that, which produces these other requirements to create a small yet useful language:

2. The implementation must leverage existing languages in a way that is uncomplicated to the user. Which means that it either compiles to some form of interpreter or to generated source code.

3. The language must focus on fully leveraging a specific data structure or family of data structures. What is and has long been fashionable in PL discussion is to elaborate upon symbolic expression. You do need some symbol definition to have a language, but the preferred orientation we use for many data structures is spatial: "top of stack", "bottom of tree", "traverse the graph", "loop over the array". Engaging with the vocabulary of the data in its context, and simply working to generalize upon that vocabulary and the bookkeeping it needs(iteration counters, selection markers, error cases etc.), rather than the generalities of algorithm definition, leads directly towards a tighter language. We can define many algorithms very well, and we're paying more attention to concurrency lately, but the software we're writing still mostly isn't about algorithms themselves. You don't end up writing one million lines of code because you have a mega-algorithm that's just really hard to express.

Consider regular expressions: a little string matching language, which can be usefully explained in a page or two. The idea of them has been around since the 50's, yet all the hip, popular languages today have implemented some syntax for them, making regex defacto one of the most common and long-lived programming languages in existence, outdoing "big" languages by many measures.

And ideally I'd like to engage in those terms: a language so small you don't really notice except to think "gee, that's handy."

Writing a competitive concurrent garbage collector is probably the hardest part these days.
Well, you pretty much can't. A lone hobbyist simply isn't going to throw together a state-of-the-art garbage collector, any more than they're going to outperform LLVM's optimisations and code-generation. The choice is between leveraging existing GCs, such as by compiling to Java bytecode, or making do with an inferior garbage collector (or the choice of several inferior garbage collectors), the way D/Nim/OCaml do.
I on the other hand thought of this:

https://www.explainxkcd.com/wiki/index.php/2309:_X

(I don't try to troll the discussion)(it's just that XKCD is where many my mind runs to on some trigger words)

Just because you can, doesn't mean you should. (On the other hand, just because you can't doesn't mean you shouldn't, and neither does just because you can. They're orthogonal things basically.)