| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Timon3 700 days ago

I don't know a lot about parser theory, and would love to learn more about ways to make parsing resilient in cases like this one. Simple cases like "ignore rest of line" make sense to me, but I'm unsure about "adversarial" examples (in the sense that they are meant to beat simple heuristics). Would you mind explaining how e.g. your `as` stripping could work for one specific adversarial example?

    function foo<T>() {
        return bar(
            null as unknown as T extends boolean
            ? true /* ): */
            : (T extends string
                ? "string"
                : false
            )
            )
    }

    function bar(value: any): void {}

Any solution I can come up with suffers from at least one of these issues:

- "ignore rest of line" will either fail or lead to incorrect results - "find matching parenthesis" would have to parse comments inside types (probably doable, but could break with future TS additions) - "try finding end of non-JS code" will inevitably trip up in some situations, and can get very expensive

I'd love a rough outline or links/pointers, if you can find the time!

[0] TS Playground link: https://www.typescriptlang.org/play/?#code/AQ4MwVwOwYwFwJYHs...

2 comments

WorldMaker 700 days ago

Most parsers don't actually work with "lines" as a unit, those are for user-formatting. Generally the sort of building blocks you are looking for are more along the lines of "until end of expression" or "until end of statement". What defines an "expression" or a "statement" can be very complex depending on the parser and the language you are trying to parse.

In JS, because it is a fun example, "end of statement" is defined in large part by Automatic Semicolon Insertion (ASI), whether or not semicolons even exist in the source input. (Even if you use semicolons regularly in JS, JS will still insert its own semicolons. Semicolons don't protect you from ASI.) ASI is also a useful example because it is an ancient example of a language design intentionally trying to be resilient. Some older JS parsers even would ignore bad statements and continue on the next statement based on ASI determined statement break. We generally like our JS to be much more strict than that today, but early JS was originally built to be a resilient language in some interesting ways.

One place to dive into that directly (in the middle of a deeper context of JS parser theory): https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

link

Timon3 700 days ago

Thanks for the response, but I'm aware of the basics. My question is pointed towards making language parsers resilient towards separately-evolving standards. How would you build a JS parser so that it correctly parses any new TS syntax, without changing behavior of valid code?

The example snippet I added is designed to violate the rules I could come up with. I'd specifically like to know: what are better rules to solve this specific case?

link

thanksgiving 699 days ago

> How would you build a JS parser so that it correctly parses any new TS syntax, without changing behavior of valid code?

I don't know anything about parsers besides what I learned from that one semester worth of introduction class I took in college but from what I understand of your question, I think the answer is you can't simply because we can't look into the future.

link

WorldMaker 699 days ago

In your specific case:

1. Automatic semicolon insertion would next want to kick in at the } token, so that's the obvious end of the statement. If you've asked it to ignore from `as` to the end of the statement (as you've established with your "ignore to the end of the 'line'"), that's where it stops ignoring.

1A. Obviously in that case `bar(null` is not a valid statement after ignoring from `as` to the end of the statement.

2. The trick to your specific case, that you've stumbled into is that `as` is an expression modifier, not a statement modifier. The argument to a function is an expression, not a statement. That definitely complicates things because "end of the current expression" is often a lot more complicated than ASI (and people think ASI is complicated). Most parsers are going to have some sort of token state counter for nested parentheses (this is a fun implementation detail of different parsers because while recursion is easy enough in "context-free grammars" the details of tracking that recursion is generally not technically "context-free" at that point, so sometimes it is in the tokenizer, sometimes it is a context extension to the parser itself, sometimes it is using a stack implementation detail of the parser) and you are going to want to ignore to the next "," token that signals a new argument or the next ")" that signals the end of arguments, with respect to any () nesting.

2A. Because of how complicated expression parsing can get, that probably sets some resiliency bounds on your "ignorable grammar": it may require that internally it still follows most of the logic of your general expression language: balanced nested parentheses, no dangling commas, usual comment syntax, etc.

2B. You probably want to define those sorts of boundaries anyway. The easiest way is to say that ignorable extensions such as `as` must themselves parse as if it was a valid expression, even if the language cannot interpret its meaning. You can think of this as the meta-grammar where one option for an expression might be `<expression> ::= <expression> 'as' <expression>` with the second expression being parseable but ignorable after parsing to the language runtime and JIT. You can see that effectively in the syntax description for Python's original PEP 3107 syntax-only type hints standard [1], it's surprisingly that succinct there. (The possible proposed grammar in the Type Annotations proposal to TC39 is a lot more specific and a lot less succinct [2], for a number of reasons.)

[1] https://peps.python.org/pep-3107/

[2] https://tc39.es/proposal-type-annotations/grammar.html

link

bazoom42 699 days ago

CSS syntax have specific rules for how to handle unexpected tokens. E.g if an unexpected character is encountered in a declaration the parser ignores characters until next ; or }. But CSS does not have arbitrary nesting, so this makes it easier.

Comments as in your example is typically stripped in the tokenization stage so would not affect parsing. The TpeScript type syntax has its own grammar, but it uses the same lexical syntax as regular JavaScript.

A “meta grammar” for type expressions could say skip until next comma or semicolon, and it could recognize parentheses and brackets as nesting and fully skip such blocks also.

The problem with the ‘satisfies’ keyword is a parser without support would not even know this is part of the type language. New ‘skippable’ syntax would have to be introduced as ‘as satisfies’ or similar, triggering the type-syntax parsing mode.

link

Timon3 699 days ago

I understand that you can define a restricted grammar that will stay parseable, as the embedded language would have to adapt to those rules. But that doesn't solve the question, as Typescript already has existing rules which overlap with JS syntax. The GP comment was:

> For example, the `as` keyword for casts has existed for a long time, and type stripping could strip everything after the `as` keyword with a minimal grammar.

My question is: what would a grammar like this look like in this specific case?

link

bazoom42 698 days ago

How about:

    TypeAssertion ::= Expression “as” TypeStuff
    TypeStuff ::= TypeStuffItem+
    TypeStuffItem ::= Block | any     token except , ; ) } ]
    Block ::= ParenBlock | CurlyBracketsBlock | SquareBracketsBlock | AngleBracketsBlock
    ParenBlock ::= ( ParenBlockItem* )
    ParenBlockItem ::= Block | any token except ( )

etc.

link