Hacker News new | ask | show | jobs
by PumpkinSpice 975 days ago
I'm surprised that you draw a sharp distinction between C and Pascal syntax. They have a shared lineage and are really very close to each other. Yeah, curly braces won over "begin" and "end", but that's not a matter of some huge conceptual rift, just convenience.

There are numerous languages today, including Haskell and Ocaml, that are far more removed from the Algol lineage than these two. Heck, the differences between Rust and C are probably more pronounced than between C and Pascal.

2 comments

One huge difference between C and Pascal grammars us that Pascal is LR(1), so it can be parsed easily, which helps one-pass translation. It also helps humans read it.

C, on the other hand, has needlessly complicated syntax; a function definition is hard to detect, and a pointer to a function is hard to interpret, because it's literally convoluted: https://c-faq.com/decl/spiral.anderson.html

Sadly, this is a general stylistic difference: where Pascal tries to go for clarity, C makes do with cleverness, which is more error-prone.

You misremember. Pascal's grammar is "easy" because it is LL(1), not LR(1)!

C is almost LR(1), if we allow prior declarations to decide how some tokens are classified, like whether an identifier is a variable or type name.

Declarations like

  void (*signal(int, void (*fp)(int)))(int);
are LR(1).

LR(1) sentences are harder to read than LL(1) because you have to keep track of a long prefix of the input, looking for right reductions (if you follow certain LR algorithms). LR parsing algorithms use a stack which essentially provides unlimited lookahead, in comparison to LL(1). Both LL(1) and LR(1) have one symbol of lookahead, but qualitatively it's entirely different because the lookahead in LR is happening after an indefinitely long prefix of the sentence which has not been fully analyzed, and has been shunted into a stack, to be processed later. Many symbols can be pushed onto the stack before a decision is made to recognize a rule and reduce by it. Those pushed symbols represent a prefix of the input that is not yet reduced, while the reduction is happening on the right of that. So it is backwards in a sense; following what is going on in the grammar is bit like understanding a stack language like Forth or PostScript.

An LL(1) grammar allows sentences to be parsed in a left to right scan without pushing anything into a stack to reduce later. Everything is decidable based on looking at the next symbol. Under LL(1), by looking at one symbol, you know what you are parsing; each subsequent symbol narrows it down to something more specific. Importantly, the syntax of symbols that have been processed already (material to the left) are settled; their syntax is not left undecided while we recognize some fragment on the right.

Under LR(1) it's possible for a long sequence of symbols to belong to entirely unrelated phrase structures, only to be decided when something finally appears on the right. A LALR(1) parser generator outputs a machine in which the states end up shared by unrelated rules. The state transitions then effectively track multiple parallel contexts.

> C is almost LR(1),

Does that include the C preprocessor?

Somehow, I recall someone here (maybe it was user walterbright) suggesting that implementing a C preprocessor was a lot of work - maybe months - so one might consider using Facebook's MIT licensed preprocessor:

https://github.com/facebookresearch/CParser

The C preprocessor is a purely functional language![1]

[1] https://web.archive.org/web/20230714010215/http://conal.net/...

This is correct, thanks!
I'm very much not a C programmer, but I've never understood why it seems far more common to write `float *foo` instead of `float* foo`. The "pointerness" is part of the type and to me the latter expresses that far more clearly.
Because the syntax is:

  <specifiers> <declarator> {, <declarator>, ...} ;
The star is a type-deriving operator that is part of the <declarator>, not part of the <specifiers>!

This declares two pointers to char:

  char *foo, *bar;
This declares foo as a pointer to char, and bar as a char:

  char* foo, bar;
We have created a trompe l'oeil by separating the * from the declarator to which it begins and attaching it to the specifier to which it doesn't.
On the other hand, for those of us who agree with the GP on this, one way around the pitfall is to have your project's style guide ban multiple declarations on a single line, or at least ban them for non-trivial variables— so `int x, y, z;` is permitted, but nothing more than that.
That's fine if it isn't used as a pretext for writing nonsense like char* p; which should likewise be banned in the same coding style document.
>This declares foo as a pointer to char, and bar as a char:

  char* foo, bar;
So that's why I've had so many problems understanding C. I come from the Pascal world, where a type specification is straightforward.
Because it isn't - `float* foo, bar;` foo is a pointer, bar is not.

(There were suggestions back in the 90s that to make C easier to parse for humans (and not-coincidentally simplify the compiler grammar) this should be `foo, bar: float*;` and your model of pointerness could actually be true. Never got much more traction than some "huh, that would be better, too bad we've been using this for 10 years already and will never change it" comments :-) (with an occasional side of "maybe use typedefs instead")

Good news (kinda): C23 allows (and GCC has for a long long time allowed) you to write typeof(float *) foo, bar; and declare two pointers. Not that I’d advocate writing normal declarations that way, but at least now you can write macros (e.g. for allocation) that don’t choke on arbitrary type names.
Which is why the convention is usually to not permit multiple declarations in one line.

If you value your codebase anyway.

Declaring X,Y and Z on separate lines for a graphics routine would just be silly, they're all the same type.

Defensive programming that extreme reminds me of the behavior I learned to avoid pissing off my drunk dad.

It's either that, or making a pointer type to bind things correctly. Convention is there to paper over shitty, shitty behavior in the language that is easy to trip yourself up on.
I think it is a manner of preference, both "float* foo" and "float *foo" are widely used.

Personally i used "float *foo" for years until at some point i found "float* foo" more natural (as the pointer is conceptually part of the type) so i switched to that, which i've also been using for years. I've worked on a bunch of codebases which used both though (both in C and C++) - in some cases even mixed because that's what you get with a codebase where a ton of programmers worked over many years :-P.

I do tend to put pointer variable declarations on their own lines though regardless of asterisk placement.

(and of course there is always "float foo[42]" to annoy you with the whole "part of the type" aspect :-P)*

Here's how I understand it.

One important (and beautiful) thing to understand about C is that declarations and use in C mirror each other.

Consider the same type written in Go and C: array of ten pointers to functions from int to int.

Go: var funcs [10]*func(int) int

C: int (*funcs[10])(int)

Go's version reads left to right, clearly. C version is ugly.

But beautiful thing about C version is that it mirrors how funcs can be used:

(*funcs[0])(5)

See how it's just like the declaration.

Go's version doesn't have this property.

So, now about the *.

Usage of * doesn't require spaces.

If p is a pointer to int, you use it like this: *p

And not like this: * p

And since type declarations follow usage, therefore "int *p" makes more sense.

There is also a good argument about "int *p, i". In the end, these usages follow from how the C grammar works.

There are many more musings about that on the web, but here is one of my favourites: https://go.dev/blog/declaration-syntax.

Edit: HN formatting.

The * binds the name, not the type.

https://godbolt.org/z/GsoxrWdrG

For the same reason you don’t write x+y * z: because then the spacing contradicts the way the priorities work in the language.

We might wish for the C declaration syntax to be <type> <name>[, ...], but it’s not: it’s <specifier>[ ...] <declarator>[, ...], where int, long, unsigned, struct stat, union { uint64_t u; double d; }, and even typedef are all specifiers, and foo, (foo), (((foo))), *bar, baz[10], (*spam)(int), and even (*eggs)[STRIDE] are all declarators (the wisdom of using the last one is debatable, but it is genuinely useful if you can count on the future maintainer to know what it means).

Everybody is free to not like the syntax, but actively misleading the reader about its workings seems counterproductive.

I swear if I see one more "int* x, y" example I will flip a keyboard. This isn't the 90s anymore. Everyone and their mother knows about this one pitfall, repeated for decades by people who have yet to read the memo that we now declare one variable per line, so it is never an issue. Even if you make this mistake, it's the least of your worries when coding in C because the compiler will typically warn you when you try to use the integer as a pointer.

Get with the program: types on the left, names on the right, one declaration per line.

Humans can normally treat C as if it is LR and get away with it, except for a few places that they can often recognize and avoid. It is a bad shortcut, but still one that you can get by with taking in a lot of code.
Everywhere but typedef can be made LR(1) IIRC
Sure, but "get away with" is not something to strive for in a programming language. I've allways hated C for the unnecessary brainpower it sometimes takes to parse a construct.
Historically, the point of writing 'begin' and 'end' instead of using curly braces was mostly support for non-ASCII character sets where the curly braces are not included. It's why C also has an alternate syntax using <% and %> and COBOL goes as far as writing out arithmetical operators as English text, such as DIVIDE x INTO y GIVING z.
Ada, at least, uses begin...end in part because it prevents certain kinds of errors. In its syntax you have to specify what you are ending, reducing the risk of invalid matches and increasing the likelihood of the error report system guessing correctly what you intended. E.g.:

    if X > 0 then
      Y := 0;
    end if;
Curly braces are shorter, but a close curly brace will match any open curly brace. Such is the nature of trade-offs.
Ngl, I think that's brilliant. Braces matching the wrong brace is like a daily occurrence. It's such a tedious small thing that constantly hounds me whenever I'm writing code
In languages without this feature (most of them), you sometimes see long blocks get labeled at the end anyway. On the other hand, you could argue that if your block is long enough it doesn't fit on the screen, then it should be its own function anyway.

Like this, except replace "..." with many lines of code.

  if (z.p == z.p.p.left) {

      ...

  } else { // z.p != z.p.p.left

      ...

  } // if
> In languages without this feature (most of them), you sometimes see long blocks get labeled at the end anyway.

True. However, in Ada at least, if the block types don't match then it's a syntax error detected at compile time by the compiler. Comments like those listed above are often not checked at compile time, and thus aren't very useful for preventing errors.

Most programming languages now support reformatters out of the box. Part of the point of those is to make mismatched closing braces more visible.
Rainbow brackets, so the brackets themselves are color coded to match with their corresponding partner, are also a godsend.
I'm surprised that code folding has never really become good/popular enough to make all these things non-issues.

Often it's only for named blocks like functions, and not for the really unhelpful bits like that conditional branch that is long, simple and deep, but really does not deserve spamming a namespace with an unhelpful identifier. And I've yet to see a deliberately short-lived folding to lessen the out of sight, out of mind tax that code folding of associated with. If there was deliberately short-lived folding, perhaps auto-reexpanding whenever the section has scrolled out of view, I'd use folding all the time, to navigate the nesting. The quasi-permanent until explicitly re-expanded cold folding? Yeah, I hardly ever use it, to many bad experiences with forgetting to re-expand.

Even with all these helpers, there's too much cognitive overhead and not enough that IDEs or plugins can do to take that away. Rainbow braces are nice and all, but it's not enough when the underlying concept is broken.
Scala 3 actually has a brace-less mode that supports this.
You can do this with PHP, too, but it’s on the rare side except in some templating niches.

The reverse-reserved-word convention like ‘fi’ to end an if block in shell (and other?) languages seems like it functions this way too.

And I guess significant indentation also does this job, albeit with some of its own hazards.

I was under the impression that COBOL's English syntax was intended to be a more human-readable approach, not so much a workaround for character set limitations.
Maybe both? COBOL predates the first draft of ASCII by several years. Character sets were far from standardized in those days.
Lots (most?) of classic COBOL used EBCDIC[0]

[0] https://en.wikipedia.org/wiki/EBCDIC