Hacker News new | ask | show | jobs
by nickmqb 2157 days ago
I have a somewhat controversial opinion on this.

This paper, like some others that I've seen, treats parsing as an academic/algorithmic topic. Given some token stream we'd like to minimize some error criteria. However, these papers ignore the fact that people don't write token streams: they write code that is formatted using whitespace. I don't know any programmer that writes their source code on a single line. On the contrary, pretty much every programmer formats their source code to some reasonable (if not complete) degree. In other words: parsing error recovery is not primarly an algorithmic problem: it's a usability problem.

Newlines and indentation are a source of extra information that we can use to infer what the programmer meant. Why throw all that away during tokenization? That's crazy! We can totally use it for error recovery/error message generation.

I decided to play around with this idea in the design my programming language Muon [1]. The language uses redundant significant whitespace for parser error recovery. This simple approach by itself has turned out to work surprisingly well! In my admittedly biased opinion, it frequently surpasses the error recovery quality in mature compilers such as the C# compiler, which has had tons of effort poured into error recovery.

Of course, a purely whitespace based recovery scheme is not perfect: there are rough edges, like having to deal with tabs vs spaces, and recovering inside a line is not usually possible. But the fact that such a simple approach has led to good results makes me think this would be a great area for future research, that can perhaps combine the best of both worlds here.

[1] https://github.com/nickmqb/muon

8 comments

I think you hit the nail right in the head, but I would say that whitespace is only one source of signal. Lately I've become convinced that languages need redundancy in their grammar to help with error recovery. Having a more flexible syntax actually hurts usability because valid but semantically incorrect code can't be caught early enough and makes it much harder to give friendly errors. rustc uses whitespace for error recovery in several places, although the most prevalent is checking whether two different elements are on different lines, like when detecting a `:` when a statement ending `;` was actually meant[1].

[1]: https://play.rust-lang.org/?version=stable&mode=debug&editio...

> languages need redundancy in their grammar

IMO extra redundancy makes Greek much easier to read than Latin. The upfront learning is a bit more effort, but then reading it is much easier. Unlike Latin it has articles (which are declined and agree with the nouns/adjectives), and the endings on verbs & nouns have way less ambiguity. I can certainly see how the same would apply to programming languages.

Somewhat related is my belief that text based languages have long since outlived their usefulness. With text based languages the compiler creates an abstract syntax tree and a bunch of metadata and then throws it all away afterwords.

Oh sure you can save an object file if and ONLY if it didn't change and it's cached locally.

A perfect system if you edit a file and save it 90% of the compilers work would already done and saved as part of the file format. Topical is the parser would usually have access to the previously unbroken file.

In my experience parsing is pretty robust and fast already, to the point where optimizing that stage of the compiler isn't worth it. Regarding the discrepancy between the AST and the source code, I think that it is worth noting that although there's a relationship between the syntax and the semantics, there need not be a 1:1 mapping between these, both allowing different features look similar (because they are pedagogically similar, like in impl Trait in Rust in argument position and return position) or dissimilar. I could see leveraging ASTs for more accurate diffing algorithms and fancy tools, but none of these would require that the AST be the source of truth.
If you have ABI stability you don't need to do any of that stuff either to code that hasn't changed.

The fastest way to compile code is to not constantly recompile it from scratch.

I mean it sounds like you have an issue with batch compilers more than "text based languages." If you didn't toss out the AST you'd have to serialize it (now you're compiling to two targets at once), and when you want to recompile you need to parse the serialized AST along with the source code for 90% of the same information.
Yes, modern query-oriented compilers (such as TypeScript and C# Roslyn) keep the AST in memory and co-operate closely with the editor via the language server protocol.

I recently saw an example of a language that aims to be AST-first rather than notation-first: it serialises to XML! http://mbeddr.com/

Nothing modern about it, lisp has been doing it for fifty years now. S-exprs also serialize to xml quite nicely ;)
In a query-oriented compiler, you can give it a source location and ask for type infomation, or where is the definition of the symbol at that location, and the compiler will do just enough work to give you the answer, using laziness and memoization to make it reasonably efficient. As far as I know, LISP systems don't work that way: they parse and evaluate (and maybe compile) as soon as an expression is entered, without the kind of laziness you see in a query-oriented compiler.

This recent link was pretty good, I think https://ollef.github.io/blog/posts/query-based-compilers.htm... and this discussion with Anders Hejlsberg https://youtu.be/wSdV1M7n4gQ

An even bigger flaw with most academic research into parser error recovery is that the vast majority of syntax errors occur from modifying a valid program to produce an invalid program, but the recovery algorithms are oblivious to this.
Have a look at this writeup by the author of lezer, if you haven't already: https://marijnhaverbeke.nl/blog/lezer.html
tree-sitter is excellent stuff! It's heavily inspired by Tim Wagner's PhD thesis (original site seems to be down, but https://web.archive.org/web/20150919164029/https://www.cs.be... works). IMHO more people should know about that work, and the sequence of work from Susan Graham's lab that led up to it. We have also been heavily inspired by Tim's work and Lukas's thesis extends and updates a number of aspects of that seminal work including, in Chapter 3, error recovery (https://diekmann.co.uk/diekmann_phd.pdf).

All that said, it's surprisingly difficult to compare error recovery in an online parser (i.e. one that's parsing as you type) to a batch parser. In the worst case (e.g. load a file with a syntax error in), online parsers have exactly the same problems as a batch parser; however, once they've built up sufficient context they have different, sometimes more powerful, options available to them (but they also need to be cautious about rewriting the tree too much as that baffles users).

Approaching this from the opposite side, language designers should also take into account how code with slight mistakes (typos and confused use of features) could be detected. Sometimes adding small things to the grammar can pay a lot of dividends when writing the production ready compiler. Alternatively when adding a new feature they should be thinking "what common mistakes will be made to produce worse errors if we introduce this with this syntax".
I think in some cases they do. Python3’s parser can detect when people use the old print syntax for example.
Which suggests that parsing should be done while editing, supporting many other refactoring tools as well. The key feature enabling this facility is incremental parsing.
Pseudo-regular parsing combined with hand-written pratt-parsers is a nice sweet-spot: easy to read, very robust, and produce great error messages. The technique was, I am told, moderately popular in the 80s and 90s. (My familiarity with them comes from a language called Spad.)
It works as long as people actually care about the formatting of their code. I've seen far too many cases where people just don't care or care about the wrong things when it comes to formatting. Like they copy/cut & paste some code and now the indentation is off by a space or a tab, but they don't care. There's also the space vs tabs rabbit hole.
> There's also the space vs tabs rabbit hole.

This one is easily fixable: just emit a parse error if you encounter a space used as an indentation character…

I sort-of do write source code in one long line: I use an autoformatter so when I want to insert something, I usually just write it in one long line and hit save. If I have a syntax error the line won’t get formatted and then I have to stare at it.

That said, I think whitespace-based error recovery is a good idea and would probably still help in my case.

Hey Nick,

Project looks great! I see on your README you're working on an LLVM backend. I'm doing the same for an unrelated language, would love to contribute to Muon if you're looking for extra hands.

> Newlines and indentation are a source of extra information that we can use to infer what the programmer meant.

in the case of python, this is part of the language. :)

Tip: Don't ask visitors to your repo to follow you on Twitter. Some of us don't use Twitter and find this advice rather annoying. Start a blog or something.