| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by madgar 3454 days ago

Technique as old as dirt. My thesis advisor used it in the 60s and 70s.

His favored approach had one artificial limitation, perhaps a remnant of the age: he limited the size of source files in bytes to some power of 2. This let him represent each token (incl. white space/new lines/comments) as a pair of fixed-size integers indexing into the file bytes. The tokenized file is an array of those pairs and the concrete syntax tree is a tree where leaf nodes are indices into the array of tokens.

Suitable for syntax-directed code generation, control-flow graph generation, static analysis, linting, pretty-printing. Super memory compact and even has an upper bound on memory footprint. The caveat is that the source language does need to support composing a "module" out of multiple files because of the limit on source file size.

1 comments

chubot 3454 days ago

I'm sure it's as old as dirt, but I don't think there is a good name for it. "Concrete Syntax Tree" is not a good name for the reasons pointed out in the article.

Do you have reference for this? I saw this problem mentioned in the write-up on the ZINC Abstract Machine by Xavier Leroy (author of OCaml). But I don't know of other papers that talk about this.

Also I believe that most open source tools do NOT have this functionality. Look at lib2to3. It's bolted on -- not exactly a clean design. Most open source front ends are not designed for tooling like Clang is.

link

tacostakohashi 3454 days ago

The ANTLR folks call it a "parse tree".

link

chubot 3454 days ago

Please read the article. I specifically mention ANTLR, parse trees, and why the lossless syntax is not a parse tree / concrete syntax tree.

link

madgar 3454 days ago

My source is in-person conversations with a real-life human being, and working on one of his codebases that employed the technique. If you want to look up his work, his name is Bill McKeeman. I personally have never felt compelled to find secondary sources when I had primary sources.

Edit: I'm personally not surprised that open-source codebases don't employ this technique. Lots of great PL work was done for private companies until the 90s and while lots of work was published in papers and books, precious few open-source PL communities historically drew from academia. I'm sure you know the counter-examples.

link