Hacker News new | ask | show | jobs
by taeric 846 days ago
My current view of what makes parsing so difficult is that people want to jump straight over a ton of intermediate things from parsing to execution. That is, we often know what we want to happen at the end. And we know what we are given. It is hoped that it is a trivially mechanical problem to go from one to the other.

But this ignores all sorts of other steps you can take. Targeting multiple execution environments is an obvious step. Optimization is another. Trivial local optimizations like shifts over multiplications by 2 and fusing operations to take advantage of the machine that is executing it. Less trivial full program optimizations that can propagate constants across source files.

And preemptive execution is a huge consideration, of course. Very little code runs in a way that can't be interrupted for some other code to run in the meantime. To the point that we don't even think of what this implies anymore. Despite accumulators being a very basic execution unit on most every computer. (Though, I think I'm thankful that reentrancy is the norm nowadays in functions.)

1 comments

What have those other things got to do with parsing though? Granted, they rely on parsing having already happened, but I don't see how there's much feedback from those considerations to the way that parsers work, or are written, or - as the article discussed - can be combined?
You can easily view it as having nothing to do with it. My push is that it is the point of parsing. You don't parse directly into understanding/execution. You parse into another representation, one that we never directly talk about, so that you can then move it into the next level.

Even English can be parsed first into the sounds. This is why puns work. Consider the joke, "why should you wear glasses to math class? It helps with division." That only works if you go to the sounds first. And you will have optionality in where to go from there.

So, for parsing programs, we often first decide on primitives for execution. For teaching, this is often basic math operations. But in reality, you have far more than the basic math operations. And, as I was saying, you can do more with the intermediate representation than you probably realize at the outset.

I agree, I think syntax could be defined as "the whole problem" if you see it holistically. Approaches that avoid scaling an intermediary grammar from lower-level tokens like the Lisp or Forth construction of "the parse directly maps to a data structure, and the data structure has the semantics" are robust. The reason why they aesthetically offend comes down to familiarity and tooling: infix expressions "look like math" - they please someone with prior training - but often act to hide important machine-level details. And the grammar helps to spread around the architecture of the language so that a syntax error compiles as a different semantic. Versus an approach like APL with a richer token set, or a ColorForth that packs in more semantic value by assigning tags to each word and making that part of the presentation.

I've moved towards designing languages now that operate over CSV source. That adds an extra dimension while still enabling convenient editing - just turn off all the parsing behavior in the spreadsheet and you can edit it as "plain text". Although, column alignment isn't always desirable in this case.