Hacker News new | ask | show | jobs
by dom0 3302 days ago
Most programming languages are context-sensitive [1] (at least with unbounded nesting), so parsing them correctly and efficiently is mathematically impossible. All practical implementations have to take shortcuts.

[1] Mainly due to begin..end blocks, curly braces or indentation (as in Python)

2 comments

Are most programming languages really context-sensitive? or aren't they mostly context-free?

My days of fiddling with writing parsers are long ago (https://www.codeproject.com/Articles/7035/A-Java-Language-ID...) but if I remember correctly most languages aim for at most a LL(2) grammar, meaning they are designed so the parser doesn't have to peek more than two tokens ahead before being able to make a correct determination.

C has some fun ones:

a * b;

Is either: a times b if a is a var, or declare a varible b with type a. If a is a typedef.

Also:

some_type b = {a, b, c, d};

Is only valid if some_type is an array or a struct. Which is possibly defined elsewhere in the source.

(I tend to see this syntax in some code bases (some_type) { a, b, c, d }, which is a bit better).

Context free grammars are perfectly capable of expressing matched curly braces, even with unbounded nesting. Am I missing something?
Yes, thinking about it that alone is not sufficient. Still, I'd claim that most languages are not context free.

CPP (Pre-processor) aside, C is not context free due to typedef making identifiers ambiguous. Also if-then-else? Since C++ templates are turing-complete, the grammar is probably unrestricted.

Python is not context free due to

    if ...:
        stmt1
        if ...:
           stmt2
        stmt3
stmt3 and stmt1 have to share the same level of indentation to form a valid Python program, but they might contain arbitrary indentation within brackets.