Hacker News new | ask | show | jobs
by nine_k 975 days ago
One huge difference between C and Pascal grammars us that Pascal is LR(1), so it can be parsed easily, which helps one-pass translation. It also helps humans read it.

C, on the other hand, has needlessly complicated syntax; a function definition is hard to detect, and a pointer to a function is hard to interpret, because it's literally convoluted: https://c-faq.com/decl/spiral.anderson.html

Sadly, this is a general stylistic difference: where Pascal tries to go for clarity, C makes do with cleverness, which is more error-prone.

3 comments

You misremember. Pascal's grammar is "easy" because it is LL(1), not LR(1)!

C is almost LR(1), if we allow prior declarations to decide how some tokens are classified, like whether an identifier is a variable or type name.

Declarations like

  void (*signal(int, void (*fp)(int)))(int);
are LR(1).

LR(1) sentences are harder to read than LL(1) because you have to keep track of a long prefix of the input, looking for right reductions (if you follow certain LR algorithms). LR parsing algorithms use a stack which essentially provides unlimited lookahead, in comparison to LL(1). Both LL(1) and LR(1) have one symbol of lookahead, but qualitatively it's entirely different because the lookahead in LR is happening after an indefinitely long prefix of the sentence which has not been fully analyzed, and has been shunted into a stack, to be processed later. Many symbols can be pushed onto the stack before a decision is made to recognize a rule and reduce by it. Those pushed symbols represent a prefix of the input that is not yet reduced, while the reduction is happening on the right of that. So it is backwards in a sense; following what is going on in the grammar is bit like understanding a stack language like Forth or PostScript.

An LL(1) grammar allows sentences to be parsed in a left to right scan without pushing anything into a stack to reduce later. Everything is decidable based on looking at the next symbol. Under LL(1), by looking at one symbol, you know what you are parsing; each subsequent symbol narrows it down to something more specific. Importantly, the syntax of symbols that have been processed already (material to the left) are settled; their syntax is not left undecided while we recognize some fragment on the right.

Under LR(1) it's possible for a long sequence of symbols to belong to entirely unrelated phrase structures, only to be decided when something finally appears on the right. A LALR(1) parser generator outputs a machine in which the states end up shared by unrelated rules. The state transitions then effectively track multiple parallel contexts.

> C is almost LR(1),

Does that include the C preprocessor?

Somehow, I recall someone here (maybe it was user walterbright) suggesting that implementing a C preprocessor was a lot of work - maybe months - so one might consider using Facebook's MIT licensed preprocessor:

https://github.com/facebookresearch/CParser

The C preprocessor is a purely functional language![1]

[1] https://web.archive.org/web/20230714010215/http://conal.net/...

This is correct, thanks!
I'm very much not a C programmer, but I've never understood why it seems far more common to write `float *foo` instead of `float* foo`. The "pointerness" is part of the type and to me the latter expresses that far more clearly.
Because the syntax is:

  <specifiers> <declarator> {, <declarator>, ...} ;
The star is a type-deriving operator that is part of the <declarator>, not part of the <specifiers>!

This declares two pointers to char:

  char *foo, *bar;
This declares foo as a pointer to char, and bar as a char:

  char* foo, bar;
We have created a trompe l'oeil by separating the * from the declarator to which it begins and attaching it to the specifier to which it doesn't.
On the other hand, for those of us who agree with the GP on this, one way around the pitfall is to have your project's style guide ban multiple declarations on a single line, or at least ban them for non-trivial variables— so `int x, y, z;` is permitted, but nothing more than that.
That's fine if it isn't used as a pretext for writing nonsense like char* p; which should likewise be banned in the same coding style document.
>This declares foo as a pointer to char, and bar as a char:

  char* foo, bar;
So that's why I've had so many problems understanding C. I come from the Pascal world, where a type specification is straightforward.
Because it isn't - `float* foo, bar;` foo is a pointer, bar is not.

(There were suggestions back in the 90s that to make C easier to parse for humans (and not-coincidentally simplify the compiler grammar) this should be `foo, bar: float*;` and your model of pointerness could actually be true. Never got much more traction than some "huh, that would be better, too bad we've been using this for 10 years already and will never change it" comments :-) (with an occasional side of "maybe use typedefs instead")

Good news (kinda): C23 allows (and GCC has for a long long time allowed) you to write typeof(float *) foo, bar; and declare two pointers. Not that I’d advocate writing normal declarations that way, but at least now you can write macros (e.g. for allocation) that don’t choke on arbitrary type names.
Which is why the convention is usually to not permit multiple declarations in one line.

If you value your codebase anyway.

Declaring X,Y and Z on separate lines for a graphics routine would just be silly, they're all the same type.

Defensive programming that extreme reminds me of the behavior I learned to avoid pissing off my drunk dad.

It's either that, or making a pointer type to bind things correctly. Convention is there to paper over shitty, shitty behavior in the language that is easy to trip yourself up on.
I think it is a manner of preference, both "float* foo" and "float *foo" are widely used.

Personally i used "float *foo" for years until at some point i found "float* foo" more natural (as the pointer is conceptually part of the type) so i switched to that, which i've also been using for years. I've worked on a bunch of codebases which used both though (both in C and C++) - in some cases even mixed because that's what you get with a codebase where a ton of programmers worked over many years :-P.

I do tend to put pointer variable declarations on their own lines though regardless of asterisk placement.

(and of course there is always "float foo[42]" to annoy you with the whole "part of the type" aspect :-P)*

Here's how I understand it.

One important (and beautiful) thing to understand about C is that declarations and use in C mirror each other.

Consider the same type written in Go and C: array of ten pointers to functions from int to int.

Go: var funcs [10]*func(int) int

C: int (*funcs[10])(int)

Go's version reads left to right, clearly. C version is ugly.

But beautiful thing about C version is that it mirrors how funcs can be used:

(*funcs[0])(5)

See how it's just like the declaration.

Go's version doesn't have this property.

So, now about the *.

Usage of * doesn't require spaces.

If p is a pointer to int, you use it like this: *p

And not like this: * p

And since type declarations follow usage, therefore "int *p" makes more sense.

There is also a good argument about "int *p, i". In the end, these usages follow from how the C grammar works.

There are many more musings about that on the web, but here is one of my favourites: https://go.dev/blog/declaration-syntax.

Edit: HN formatting.

The * binds the name, not the type.

https://godbolt.org/z/GsoxrWdrG

For the same reason you don’t write x+y * z: because then the spacing contradicts the way the priorities work in the language.

We might wish for the C declaration syntax to be <type> <name>[, ...], but it’s not: it’s <specifier>[ ...] <declarator>[, ...], where int, long, unsigned, struct stat, union { uint64_t u; double d; }, and even typedef are all specifiers, and foo, (foo), (((foo))), *bar, baz[10], (*spam)(int), and even (*eggs)[STRIDE] are all declarators (the wisdom of using the last one is debatable, but it is genuinely useful if you can count on the future maintainer to know what it means).

Everybody is free to not like the syntax, but actively misleading the reader about its workings seems counterproductive.

I swear if I see one more "int* x, y" example I will flip a keyboard. This isn't the 90s anymore. Everyone and their mother knows about this one pitfall, repeated for decades by people who have yet to read the memo that we now declare one variable per line, so it is never an issue. Even if you make this mistake, it's the least of your worries when coding in C because the compiler will typically warn you when you try to use the integer as a pointer.

Get with the program: types on the left, names on the right, one declaration per line.

Humans can normally treat C as if it is LR and get away with it, except for a few places that they can often recognize and avoid. It is a bad shortcut, but still one that you can get by with taking in a lot of code.
Everywhere but typedef can be made LR(1) IIRC
Sure, but "get away with" is not something to strive for in a programming language. I've allways hated C for the unnecessary brainpower it sometimes takes to parse a construct.