Hacker News new | ask | show | jobs
by melloclello 3302 days ago
> It’s no secret that I really like Clojure and as a lisp, it was the easiest language for me to start the prototype with, but there’s no reason this couldn’t be done for any language with a dynamic runtime. The rest is mostly simple analysis of an AST and some clever inference.

I have looked into this. It is kind of criminal that for most real world languages (Ruby[1], C[2] etc), it's not possible to just define a grammar and throw it at a standard parser generator for them - they generally have one or two quirks which make this infeasible.

In my ideal alternate universe it would be considered unthinkable to publish a language without also publishing a grammar in a standard format for said language, which can then be plugged into your favourite text/semantic/tree editor. Our tools should dictate our languages, not the other way around.

[1] http://programmingisterrible.com/post/42432568185/how-to-par...

[2] https://en.wikipedia.org/wiki/The_lexer_hack

1 comments

Most programming languages are context-sensitive [1] (at least with unbounded nesting), so parsing them correctly and efficiently is mathematically impossible. All practical implementations have to take shortcuts.

[1] Mainly due to begin..end blocks, curly braces or indentation (as in Python)

Are most programming languages really context-sensitive? or aren't they mostly context-free?

My days of fiddling with writing parsers are long ago (https://www.codeproject.com/Articles/7035/A-Java-Language-ID...) but if I remember correctly most languages aim for at most a LL(2) grammar, meaning they are designed so the parser doesn't have to peek more than two tokens ahead before being able to make a correct determination.

C has some fun ones:

a * b;

Is either: a times b if a is a var, or declare a varible b with type a. If a is a typedef.

Also:

some_type b = {a, b, c, d};

Is only valid if some_type is an array or a struct. Which is possibly defined elsewhere in the source.

(I tend to see this syntax in some code bases (some_type) { a, b, c, d }, which is a bit better).

Context free grammars are perfectly capable of expressing matched curly braces, even with unbounded nesting. Am I missing something?
Yes, thinking about it that alone is not sufficient. Still, I'd claim that most languages are not context free.

CPP (Pre-processor) aside, C is not context free due to typedef making identifiers ambiguous. Also if-then-else? Since C++ templates are turing-complete, the grammar is probably unrestricted.

Python is not context free due to

    if ...:
        stmt1
        if ...:
           stmt2
        stmt3
stmt3 and stmt1 have to share the same level of indentation to form a valid Python program, but they might contain arbitrary indentation within brackets.