| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Karliss 732 days ago

In real world things don't fit neatly in the boxes established by textbook definitions. The borders between various concepts are fuzzy, and in production implementations for practical purposes like performance, code simplicity, better error reporting and exact usecase different stages of parsing and the parsed representations can be expanded, narrowed or skipped.

A lot of time you will have a syntax tree. It might have preserved some of the concrete syntax details like subexpression ranges for error reporting and IDE functionality or somewhat optional nodes, but at the same time it might also contain information obtained form semantic analysis (which from computer science perspective isn't even a concern of parser), it might not even be a tree. And all of that potentially produced in single pass. Is it a concrete syntax tree, is it abstract syntax tree, is it a thing after AST? Just because a data structure didn't throw away all concrete syntax details, doesn't mean it contains all of them. From my experience in such situations it's more likely to be called AST, as it's closer to that than concrete syntax tree.

It also depends how you define the language that you are interested in (even if it's the same language). The purpose of parsing isn't necessarily always code compilation. Formal grammars and parsers can be used for all kind of things like defining a file format, text markup, and also tasks like manipulation of source code and checking it for errors. Typical example of details not included in AST is parentheses. But that doesn't mean AST can never contain node for parentheses. If they have a meaning for task you are trying to achieve nothing prevents from assigning a node within AST tree. For example both Clang and GCC in some situations will give a warning depending on presence of parentheses even though they are meaningless based on C++ syntax. If you define comments as part of the language then they can be kept and manipulated within AST.

CST doesn't really guarantee that you won't lose any information. The parsers don't operate on bytes they operate on abstract symbols, which might directly correspond to bytes in the file but not always. Again in real world systems what you actually have is 2-3 languages stacked on top of each other. Which of the CST are you talking about? Preserving CST for one stage doesn't mean no information was lost in previous steps. C++ standard defines ~6 steps before the main C++ language parsing , many of which can be considered a separate formal languages with their own CST/AST.

1) Text decoding bytes -> text. While most text encodings are trivial byte->character substitution, variable length encodings like UTF8 can be described as (simple) context free grammars. I don't think any programing language toolchain does Unicode normalization at this stage, but in theory you could have a programming language which does that.

1.5) Trigraph substitution

2) text -> preprocessing tokens

3) preprocessing tokens -> preprocessing AST

4) as a result of executing preprocessing directive execution you obtain new sequence of tokens

4.5) string literal merging

5) main parsing

In practices some of these steps might be merged and not executed as separate stages, there are also a few more transformations I didn't mention.

Stuff like this makes real world source to source transformations messy. As the later grammars are operating on symbols which only exist in intermediate steps and don't always have simple 1:1 to mapping to the input file.

And in some cases you might have some custom algorithm doing a transformation which doesn't fit the model of context free grammars at all, thus whatever it did isn't part of any formal CST for the language (language in terms of formal grammars, not a programming language). Python is a good example of this. The indentation based scopes can't be handled by context free grammars, it relies on magic tokenizer which generate "indent", "dedent" tokens, so if you follow formal definitions CST of main python language doesn't contain exact information about original indentation. The fact that you can get it from LibCST is stretching definition of CST/changing the language it is parsing. At that point once you add all the extra information are you really building a CST or are you making an AST for a language where every character is significant, because you redefined which parts of program are important.

With all that said I wouldn't be surprised that the thing slack did was using something closer to AST (with some additional syntax details preserved) than CST (with additional analysis and done). If you are not building a general purpose tool for making small adjustments to arbitrary existing codebase (otherwise preserving original code), it's not necessary to preserve every tiny syntax detail as long as comments are preserved. I would expect them to be using a standardized code formatter anyway so loss of insignificant whitespace shouldn't be a major concern, and the diff will likely touch almost everyone line of code.

Whether "AST" or a "CST" is useless for specific task is in many situations less about "AST" vs "CST" but more about design choices of specific programming language, parser implementation and pushing things beyond the borders of strict formal definitions.