Hacker News new | ask | show | jobs
by olliej 1414 days ago
I think the alias analysis “pro” is overblown, it’s UB to take the address of a parameter via any mechanism other than explicitly taking the address - which a compiler obviously sees, and because it’s UB the compiler optimizes is free to assume no one is taking the address. Then for any parameters that are passed by reference the compiler has to assume there are other references so there’s no gain.

Honestly the only thing that really stood out as nice is the default pass by word-sized value, in the context of templates - it’s a thing that is achievable in C++, but requires a bunch of obnoxious additional templates that aren’t even part of the standard library so everyone ends up reimplementing the same cruft. Happily I believe there’s a proposal to add this exact functionality.

I also loathe their desire to listen to the BNF maximalists insistence on not having any “ambiguity” from <>s. I’m sorry it’s clearly parseable, and <>s are the standard token for decades. Switching to []s doesn’t make it less conceptually ambiguous, if anything it makes it more ambiguous to a human reader. The only people who don’t want <>s are PLT academics obsessed with forcing their dragon book idea of what a grammar should be. You can’t argue you’re doing it because the ambiguity in a grammar or lexer is bad, because then you would also drop infix operators.

Then in carbon the more reasonable adoption of pascal’s : notation for typing a variable or parameter removes the most common case of the supposedly terrible ambiguity anyway.

4 comments

> I also loathe their desire to listen to the BNF maximalists insistence on not having any “ambiguity” from

Odd that you put ambiguity in quotes. I guess it's not real ambiguity then.

Ivory tower academics, as you see them, actually care about stuff because it affects you. I've heard too many times the knee-jerk response of "it's just theory" as if that actually meant anything. Try implementing it yourself and you'd start to understand that little bits of crap typically don't add, they multiply (Edit: and I'm hurting from exactly this right now), so don't diss theory until you've (ironically) had some practice with it.

> Odd that you put ambiguity in quotes. I guess it's not real ambiguity then.

Because a parser is never really ambiguous is it? Nobody writes a parser that has a random decision on which rule to take.

In parser theory there is such a thing as an ambiguous grammar. Parser generators detect such a grammar and never produce an actual parser for it. So in a sense most parser that you actually use don't have ambiguous grammars by definition. That said, you could have a manually written parser that thinks it implements an "ambiguous grammar", but in reality chose an deterministic (even if possibly arbitrary l) resolution of the ambiguity.

It's still ok talk about the grammar as being ambiguous no matter what a purported implementation of such grammar is deterministic (because it ultimately does not implement the actual grammar)

I guess I don't understand why we use languages for writing grammars that let you express ambiguity in the first place.

I think parser generators were a mistake.

for someone who is the founder of TruffleRuby (I've checked, and you are <https://docs.oracle.com/en/graalvm/enterprise/21/docs/refere...>), you sure have made some very odd comments and I don't know what to make of them.
> you sure have made some very odd comments and I don't know what to make of them

Well they’re from practical experience aren’t they.

I agree it's a pain. Could you share some alternatives?
Imperative parsing - PEGs or just plain old recursive descent.
I put ambiguity in quotes, because the only people who claim it makes things ambiguous are theoreticians - who you appear to have decided I think of as being "ivory tower" which we'll address in a bit - who favor a theoretical definition of what a grammar should be, and disregard (1) how people understand language, programming or otherwise, (2) insist on a mathematical definition of grammars, and grammar classes, that doesn't reflect reality. Insisting on that mathematical definition of what a grammar is, is the thing that creates the ambiguity that is so despised here, and (3) want to prioritize the ease of parsing for a dumb parser over the reality in which a parser has actual contextual information.

It's also well understood that for (2) this theoretical ambiguity is not actually a real problem for humans - the things for which reading is actually important - because otherwise they would not allow infix operators, which are ambiguous according to their definition, but much like languages that they have chosen to define as ambiguous, infix is not.

In that theoretical model of grammars we define the power of different classes of parsers to go something like LL < LALR < LR < GLR. This is how the mathematical theory of grammars goes, and everything beyond LL is basically required to be based on some PDA. The ambiguities in the PDA's state transitions are what "prove" a grammar to be ambiguous. Further LL parsers are considered the least powerful, and incapable of parsing the majority of grammars, especially those that belong to "real" languages.

The problem is that this model of grammars is arbitrarily restricted, and many (I would guess most real world) parsers for "real" languages are not any of these classes, which is certainly odd given the completeness indicated by theory. That's because this restricted definition of what a parser is says that most real world programming languages require an LALR parser, and in reality the big ones likely need LR or GLR.

In the real world, what is used could probably be classed as LL(infinity), which is impossible according to the theoretical parser classes in use. The reason that we use these impossible LL(infinity) parsers, is because they are human writable, and human understandable, and also vastly faster than the state machines LALR parsers (think bison and yacc IIRC), even more so for LR, and slower again for GLR (even if you can get away with a tomita parser, if you could find a generator for one). Additionally all the non LL parsers can only be reasonably implemented via state machine, and beyond trivial examples, a human is not going to be able to write the state transition tables, let alone modify it in future.

That means using a tool to generate the parser for you, and this really is the heart of the theoretical definition being used, to be a "real" parser, you need to be able to make a tool that will read a BNF grammar, and have it produce a program that will be able to take some text, and then, accumulating no state beyond a stack and its current state, produce an AST.

I have never been shown any justification for why this makes a grammar more understandable, it is simply stated as fact that being easier for a machine to parse means being easier for a human. The contra to that is that most people seem find infix much easier to understand than the non ambiguous pre- and post-fix notations. You could argue that's familiarity, but I would say "so what?" that people can find the "ambiguous" grammar easier to understand than the non ambiguous one indicates that the claim of "easier for machine == easier for people" to not stand up.

Anyway that giant wall of text is my position, and that is why I don't accept "<> for generics is ambiguous" as the sole justification for a change from the root language this is meant to be a replacement for. If they were creating their own language then sure, they could go ham and make the least ambiguous language that they liked (but lisp I think has technically already done that), but that is not what Carbon is doing.

Finally (and so the Great Wall of text continues) on your "ivory tower academic" accusation, I disagree. Academic CS is hugely valuable, and having been in CS academia for many years I feel I am allowed to express my disagreement with some of it.

If credential waving is needed, for academics: my masters is in programming language theory, and I was the TA for among other things my university's courses on programming language theory, parsing, etc. In industry I've worked on clang for many years, and for parsing specifically I wrote the javascript parser in javascriptcore.

I understand parsing, and I understand the PLT subfield of CS.

My frustration with academic PLT is more that by confining the definitions of what is a valid grammar, or what is a real parser, we don't see any real research into into those aspects of PLT and in my experience it seems like an area where we just treat the dragon book as being the definition of everything that matters. Then we turn around say anything that doesn't fit neatly in that definition must be bad, and can be ignored.

>because otherwise they would not allow infix operators, which are ambiguous according to their definition

This is not true, infix operators can be easily encoded in BNF as a series of rules, each rule recognizing expressions unique to it as well as all the expressions of higher precedence. The following for example is a grammar that parses the four arithmetic operations with the usual precedence.

Expr ::= Addition

Addition ::= Term ((PLUS | MINUS) Term)*

Term ::= Prim ((MUL | DIV) Prim)*

Prim ::= NUM | OPEN_PAREN Expr CLOSE_PAREN

>The problem is that this model of grammars is arbitrarily restricted

It is restricted, but not arbitarily so. There is a very natural constraint that explains why Parsing Theory doesn't like ad-hoc parsers in the way you hate so much : the parser shouldn't depend on the "content" of the stream. The easiest way to see what I mean by this is to demonstrate how one violation of it works : C++'s ambiguous "X Y();" is either a function prototype or a variable declaration, depending on whether Y is a valid type name. This is extremely ugly to me, a parser shouldn't accumulate type information while it's parsing, even if it's as trivial as a set of type names. This is an entirely different concern that should be left to an entirely different module of the compiler. C++'s syntax is full to the brim of such ugly corners, the standard doesn't even bother to specify a binding BNF on the implementations!

Compiler Theory settled on an interface design where the parser must be able to output a concrete syntax tree from a stream without relying on global ad-hoc state like a table of special names or whatever. You framing it as "accumulating no state beyond a stack and its current state" is being unfair, it would be like framing the point of local variables as just a limiting thing that we invented arbitarily : Yes, that's the entire point, No, it's not arbitary, it's intentionally limiting, because the resulting design is easier to understand and simply more elegant.

>lisp I think has technically already done that

There is a middle ground between lisp's austere and misguided "F*k You, Why Should Parsing Be Hard Just To Please Your Eyes?" syntactical philosophy, and between C++'s and Perl's "Lol, Grammars Are For Weaklings". Complex syntax is good, it automates the tedious tree-layout that is the text of lisp programs, you don't need to repetitively state the precedence and hierarchy of code elements, the parser infers some of them for you. However, as in all "The Machine Infers Things" situations, we have to be careful with the rules of inference. We don't want the machine to infer counter-intuitive things, or fail to infer entirely. CFGs formalism is an elegant tradeoff.

>we don't see any real research into into those aspects of PLT

It might be the other way around, that allowing arbitary ad-hoc state in the parser results in too ugly a formalism or no interesting formalism at all, therefore people settled on disallowing it. It's like how theoretical mechanics famously hates friction, it isn't arbitary or born of prejudice, it's just that friction is too ugly and hard to yield an elegant and concise theory, therefore people just pretend that it doesn't exist.

On the surface, it doesn't seem hard to try and formalize some of the real-world parser hacks into the grammar formalism. For example, C++'s "this should be parsed as X unless the name is in a certain set then it should be parsed as Y" doesn't appear too wild a constraint type to encode in a formal grammar, my suspicion is that if you try to do it you will either discover that (a) it doesn't actually change any of the implication of traditional theory, and therefore it is just a "sugar" over traditional constraints : cool, but it technically doesn't tell us anything new (b) it significantly changes the implications and guarantees of theory, to the point that it's not an elegant or beautiful theory anymore. You can introduce control flow into prolog, but why bother ? it wouldn't be prolog anymore.

I don't want to imply that Academia is all-seeing or all-wise, but it is a very powerful novelty-rewarding machine, if there is Anything theoretically interesting to be gained from formalizing parsing hacks, I find it very hard to believe it hasn't been studied in 65+ years of grinding on parser theory and formal grammars.

>I would guess most real world parsers for "real" languages are not any of these classes

I find this too strong a claim. I would bet money it's false for anything but C, C++, Perl, Ruby, and let's say PHP just to be safe. Mention any language that isn't on this blacklist, and I bet I can find BNF-like description of the grammar that tells you perfectly well how to obtain a CST from text, which means they are CFG.

> which a compiler obviously sees

Not if it's in a different compilation unit, it doesn't. Or if it's just not inlining the function for some reason, then it probably also doesn't. Which is why Carbon's restriction here is so useful and practical.

Ok, I'm willing to accept be wrong here. My interpretation of what I read is that Carbon is disallowing taking the address of a parameter. Very specifically:

    void f(int i) { &i; /* disallowed */ }
I think that's a trivial case and is hopefully obviously easy to detect, and I know we're considering relevance to escaping which this doesn't do, but it seems that taking the address is explicitly disallowed regardless of use. Anyway given the many references to references, lets do

   void g(int &i) { &i; /* disallowed once more */ }
To me this seems no harder to detect, but this does seem like a case where banning taking the address /could/ impact things. But I can't see what the win is.

I'm assuming its a different TU or something else that prevents inter procedural optimization. Could you give some examples where this allows performance wins?

That is what Carbon prevents, yes. And, inside a particular function that does it, it's obviously easy to detect. But in the calling function, it's not easy to detect.

If in boo.cpp you call your g() function which is in foo.cpp, how does the compiler know if it's taking the address or not? It's in a different compilation unit. All the compiler knows at that point is there's a function 'g' that takes an int by reference. It has no idea what 'g' does with it, it doesn't even know where 'g' is - that won't be known until link time. So the compiler is forced to be conservative and allow for 'g' to do anything that C++ allows - which includes stashing the reference somewhere or const_cast'ing away a const& into a mutable reference.

This is where Carbon's win comes in. Since that behavior just isn't allowed, the compiler doesn't have to be conservative. It doesn't need to somehow see into different compilation units to perform helpful optimizations (critically for const unique_ptr& or const shared_ptr&)

> it’s UB to take the address of a parameter via any mechanism other than explicitly taking the address - which a compiler obviously sees, and because it’s UB the compiler optimizes is free to assume no one is taking the address.

I don't get that: can you express in C++ a code that "take the address of a parameter via any mechanism other than explicitly taking the address"?

Consider this C code (also "works" if compiled as C++):

    int main(void) {
        int x = 0;
        int arr[1];
        int *p = arr + 1;
        *p = 42;
        return x;
    }
On a lot of systems (e.g., https://godbolt.org/z/jYqM8TT3Y), it just so happens that `x` is right above `arr` on the stack, so that code will return 42. But that code is absolutely UB.

The more general name for this concept is "pointer provenance". Basically, you can't pull pointer values out of thin air; you have to derive them from operations rooted at taking the address of something within the same allocation.

That's a buffer overflow. The optimizer doesn't need to reason about changing the behavior of such things.
That’s the point - it’s UB to go off the end of an array, or more generally to dereference a pointer outside of the bounds of the target object (yes, buffer+buffer_length is a valid pointer for the purpose of comparisons, but dereferencing it would be UB).

However in practice you can do this, and walk the stack to find the parameters or what have you, and then you’ve got a pointer to a parameter without the compiler being aware of it. But this is all explicitly UB, so it’s ok for the compiler to be unaware of it, and it’s free to do whatever codegen it wants given the assumption that UB can never happen.

The point is that on systems where that code returns 42, `p` has the exact same value it would if I did `int *p = &x;` instead, but not the same provenance.
And because C++ says "one past the end" pointers are a thing, both these pointers can exist.

As written p is a one-past-the-end pointer into object arr, but the address one past the end of arr may well be the address of x. If pointers are just addresses, these pointers are the same... right?

Neither C nor C++ currently actually explain how this works for their "abstract machine" in the standards documents. The reality is that your C++ compilers (and any non-toy C compilers) have pointer provenance because it's a nightmare to optimise C programs without, but since it isn't documented anywhere (my understanding is that C23 might fix this for C by taking a TS and an equivalent fix via P2318 could land in C++ 26) it's difficult to say if you ever find bugs in their behaviour.

> And because C++ says "one past the end" pointers are a thing, both these pointers can exist.

While one-past-the-end pointers are allowed to exist, they are not allowed to be dereferenced.

> these pointers are the same... right

The entire point of provenance is that even though their numerical values are the same, they are not the same.

> Neither C nor C++ currently actually explain how this works for their "abstract machine" in the standards documents.

While it isn't mentioned explicitly, it can be inferred from other things that the standard does say. The compiler authors didn't just make it up.

Imagine the parameter is passed on the stack, I can take the address of a local variable (which in general forces it to be on the stack), I can then walk up the stack from that address to where the parameter is.

This is undefined behavior. Because it is UB that compiler is allowed to assume it cannot happen. Therefore I have the address of a parameter, and can pass that to a closure or whatever, and the parameter has escaped, but the compiler doesn't know.

More importantly by definition the compiler does not need to know, because a program that does that is no longer well defined.

Why not just use parentheses, because, you know, templates are just AOT-functions behind the scenes! Try to come up with an example where this syntax is worse than (angle or square) brackets...