|
|
|
|
|
by madgar
3407 days ago
|
|
Technique as old as dirt. My thesis advisor used it in the 60s and 70s. His favored approach had one artificial limitation, perhaps a remnant of the age: he limited the size of source files in bytes to some power of 2. This let him represent each token (incl. white space/new lines/comments) as a pair of fixed-size integers indexing into the file bytes. The tokenized file is an array of those pairs and the concrete syntax tree is a tree where leaf nodes are indices into the array of tokens. Suitable for syntax-directed code generation, control-flow graph generation, static analysis, linting, pretty-printing. Super memory compact and even has an upper bound on memory footprint. The caveat is that the source language does need to support composing a "module" out of multiple files because of the limit on source file size. |
|
Do you have reference for this? I saw this problem mentioned in the write-up on the ZINC Abstract Machine by Xavier Leroy (author of OCaml). But I don't know of other papers that talk about this.
Also I believe that most open source tools do NOT have this functionality. Look at lib2to3. It's bolted on -- not exactly a clean design. Most open source front ends are not designed for tooling like Clang is.