Hacker News new | ask | show | jobs
by jcranmer 2724 days ago
Here's an easy way to understand how these things work: in C, the type of a pointer/function/array mess is declared by how it's used. For a declaration like "int ( * ( * foo)(void))[3]", you can read it as "for a variable foo, after computing the expression ( * ( * foo)(void))[3], the result is an int."

So one way to read C "gibberish" is to ignore the type at the beginning and parse the rest as an expression like a normal parse tree. First we take foo. Then we dereference it (so foo is a pointer). Next we call it as a function with no arguments (so foo is a pointer to a function that takes no arguments). Next, we dereference it again. Then we index into the result as an array. Finally, we reach the end, so we look at what the declared type and find that this type is an int. So foo is a pointer to a function that takes no arguments and returns a pointer to an array of 3 ints.

You can also use this to go backwards. What's the syntax for a function that takes an integer argument returns a pointer to an array of function pointers taking no arguments and returning integers? Well, we want to take foo, call it, dereference it, then index into an array, then dereference it again, then call it again, then return an int. Or int (* (* (foo)(int))[5])(void).

4 comments

How about just using Ada? It has the added bonus of not being a gimmick (depending on who you ask I suppose ; )

Ada: type Ret_Typ is array (1..3) of Integer; Foo : access function return not null access Ret_Typ := null;

C: int ((foo)(const void *))[3]

Cdecl: declare foo as pointer to function (pointer to const void) returning pointer to array 3 of int

In the interest of furthering annoying language smuggery, the rough Rust equivalent:

    foo: fn() -> Box<[i32; 3]>
Alternately, if the pointer is into static memory and not something allocated on the heap:

    foo: fn() -> &'static <[i32; 3]>;
That's pretty nice to look at and not too hard to read. In my opinion, for commonly used syntax (like fn decls), some well-chosen punctuation marks (', ->, :, in this case) are often boon to readability compared to keywords. So I think the Rust syntax in this case is nicer than Ada's.

But in any case, while complicated C declarations may be uglier and take more effort to read than those in other languages, they are at least tractable once you learn the trick of "declaration follows use" and working backwards as GP describes.

Separately, though, what do you mean by your "gimmick" comment?

> Separately, though, what do you mean by your "gimmick" comment?

Just meaning the website CDecl - its a neat tool to make C readable in English, but Ada is a real language used to make planes fly etc. Many people have contempt for it though which is why I made the joke xD

Interesting that you can note which type of memory an anonymous type comes from in Rust. I suppose its for optimization purposes? Doesn't seem that helpful from a pure typing perspective.

As an aside I doubt a layman would be able to understand that notation in Rust, whereas my girlfriend might be able to grasp or read Ada code or the output of CDecl.

> Alternately, if the pointer is into static memory and not something allocated on the heap

There's really no need to box a 12-bytes array in the first place.

I'm not defending C's syntax as sane here, because it's not. It boils down to have two problems:

1. The syntax isn't "type id, id, id;", it's "type expr, expr, expr;" The trend for C-style languages have been to move to the former type syntax, so C/C++ is the anomaly here.

2. Pointer declarators show up to the left of the name while function and array declarators show up to the right of the name. This means you can't figure out the type by scanning in one direction. Contrast this with LLVM, where function arguments and pointer types both go to the right of the leaf type (while arrays are infix), or Rust, where they both live on the left of the leaf type.

Reading for both ends is maddening for seasoned devs, but in a general sense most languages are symbol salads these days for arbitrary, subjective reasons. The decisions made during the C development to squeeze the juice out of 60 character wide terminals haunt us to this day... like case sensitivity
> like case sensitivity

What's wrong with case sensitivity?

Sure the current doctrine of programming says its good, but lets take a step back. First of all its counter intuitive to writing English and it damages readability so you may have naming conflicts without realizing it or the compiler being able to warn you.

Also, in addition to remembering what a function is called you have to also remember its casing which different libraries may want to be phrased differently (like C libraries versus C++ libraries).

Bypassing the very real issues it may cause in the design of your software it also may lead to silly library cruft (see Java's Color class) - how many ways can you spell blue?

https://docs.oracle.com/javase/7/docs/api/java/awt/Color.htm...

Case-sensitivity is, more or less, a default: you have different strings that have distinct encodings, so you treat them as different identifiers.

The alternative to case-sensitivity requires your compiler to know about case, and, more importantly, how to do case-folding. At that point, you can either choose to (a) restrict identifiers to a Some (probably ASCII) limited subset of characters, (b) only make some subset of acceptable characters (reliably) case-insensitive, (c) require every compiler to have tables for case-folding.

That's before we get into the locale-dependence of case-folding, which makes the letter "i" unreliable.

And you still have to distinguish Color and Colour.

Case insensitivity sounds good except it quickly runs afoul of "language isn't so simple."

If I define a variable as "groß", does "GROSS" or "GROẞ" match it (or both, which probably implies "gross" would match as well)? What about "ê" and "E"? Or the infamous i/I/İ/ı debacle, which could make matching "insane" to "INSANE" locale-dependent? How do you define case-insensitivity in a way that makes sense?

The other important part is to remember that precedence follows the same precedence as ordinary expressions, i.e. array subscripting and function call have higher precedence than pointer dereference.

It is notable that the chapter in K&R which discusses declarations also presents a partial version of the cdecl program and one of the exercises is for the reader to complete it --- really helping to dispel the notion that compilers are not mysterious magic. In my experience, it's rare for an introductory book on a programming language to also contain such "hints" on how it could be implemented.

However the declaration-mirrors-use idea does not apply to function arguments. If you have "void (* f)(int * arg)", you would not use it like "(* f)(* arg)" unless your arg is actually "int * * ".

This could be fixed. Instead of "void (* f)(int * x)" we would write "void (* f)(x &int)". Now it makes sense, the declaration says that we could call the function if we pass the address of some int y, as if by "(* f)(&y)". The specific syntax "x &int" says that the address of an int is x, the same way as "int * x" says that dereferenced x is an int.

What about "void (* f)(int x[10])" (pretending arrays could actually be passed)? With the pointer we relied on the existing opposite of the dereference operator, but there is nothing like that for arrays, that would make an array out of an element. Let's look to Python for inspiration, where the expression "[y]* N" will make a list of N elements with the value y. This gives us: "void (* f)(x [int]* N)". See how the declaration tells us that we could call the function using "(* f)([y]* N)" for some int y.

There's one more we need to solve: "void (* f)(void (* g)(int))". Since the parameter g of * f is a function pointer, we need to pass the address of a function, so clearly & will be involved. But we need a function to take the address of, and we don't have any available. Inspired by the C++ lambda syntax, let's invent function conjuration: "(Args) -> Ret" is an expression that conjures a function taking Args and returning Ret. Hence the solution: "void (* f)(g &(int) -> void)". It says that you could write "(* f)(&(int) -> void)", to call * f with the address of a conjured function taking an int and returning void.

We do need to be aware that the syntax for arguments in function conjuration expressions is the same as in top-level declarations. So we would need to rewrite "void (* f)(void (* g)(void (* h)(int * x)))" as "void (* f)(g &(void (* h)(x &int)) -> void)". So for each function pointer, its arguments must be declared in the other declaration mode.

Since this makes no sense at all, we have to conclude that the original C declaration syntax forms needs to be deprecated and only the newly invented syntax forms should be used.

  x &int;   (int * x)
  x &&int;   (int * * x)
  f &(x &int) -> void;   (void (* f)(int * x))
  f &(x [int]* 10) -> void;   (void (* f)(int x[10]))
The new syntax can also be used for function declarations:

  main (argc int, argv [&char]*?) -> int
  {
      return 0;
  }
See how we've invented a different declaration syntax (some sort of dual of C's current syntax), that actually respects "declaration-mirrors-use" better than C does and makes much more sense to humans.
1) The use of the Python feature for arrays I find confusing as it is not orthogonal to the rest of your new and improved syntax for C.

Everywhere else, you change C's declaration order of <declaration-specifier> <declarator>, in your new syntax to place the identifier of the declarator first, followed by any pointer ops, and lastly the type. You are changing the pointer op "" from a prefix that needed to be read right-to-left, after locating the identifier of the declarator, into a suffix "&" following the identifier, to be read left-to-right.

I agree that your change to left-to-right declaration order is definitely more readable.

2) But in your array syntax, borrowed from Python, the type is placed inside the array brackets, which used to hold the constant-expression denoting the array size. The array size is moved from within the brackets to be last, instead of the type being last, as in all your other syntax "rules". So, for arrays, the declaration syntax no longer reads simply left-to-right, since type is between declarator identifier and array size.

Wouldn't this be clearer, to have the type last and the constant-expression remain inside the array brackets? C syntax: (void ( f)(int x[10]))

use this instead for your new C syntax: f &(x [10] int) -> void;

3) I have a similiar problem with your function syntax:

instead of:

main (argc int, argv [&char]*?) -> int { return 0; }

why not put the type last, so as to be consistent with all your other syntax?

main (argc int, argv [] &char]) -> int { return 0; }

This is how the Go programming language does it, except for the preceding "func" reserved word and "string" in place of pointer to char: func main(argc int, argv [] string) int ...

5) The biggest problem I have is with adding "C++ lambda syntax" to C, to solve the problem of passing a function as actual parameter argument. That would mean you have 2 styles of pointers, one as a prefix and one as a suffix to the declarator identifier. So you now have to read both right-to-left and left-to-right, which seems to cancel out the benefits of only reading declarations in left-to-right order!

Would it be simpler, and preserve left-to-right declaration order, to provide a FunctionType as in the Go programming language? A parameter that is passed a function as argument is declared to have a FunctionType. Pointers to function are not apparently needed, at least not at the user level.

6) Q: How do these proposed changes affect the parsing of the new C syntax? Current C syntax can be parsed with predictive, non-backtracking parsers, in linear-time. I don't want to use backtracking, GLR, or other complex methods, if they are avoidable. At least C can now be parsed with with Yacc or Bison. (See A13 Grammar in K&R, "The C Programming Language" or Jacques-Henri Jourdan, François Pottier "A Simple, Possibly Correct LR Parser for C11")

For the arrays, I agree "[10] int" is better.

For functions, I think the -> syntax is the only thing that makes sense. It's just natural, first you need the arguments then you get the return value.

> That would mean you have 2 styles of pointers, one as a prefix and one as a suffix to the declarator identifier. So you now have to read both right-to-left and left-to-right, which seems to cancel out the benefits of only reading declarations in left-to-right order!

I'm not following. There are not two styles of pointers, a pointer is declared like "&type". Functions are declared like "(args) -> ret" which is read left-to-right (function taking such arguments and returning such value). A function pointer is simply a pointer to a function like "&(args) -> ret".

> Q: How do these proposed changes affect the parsing of the new C syntax?

I guess I should have added </sarcasm>? C would never adopt such a radical change. In any case, I don't see how it would be fundamentally more difficult to parse than the current declaration syntax. There would be problems disambiguating the two (what is "foo bar;" if foo and bar are both typedefs?). Maybe changing to require a colon after the name would make that simpler "fun: (arg: int) -> int".

By 2 styles of pointers used in functions, see below taken from your examples.

By reading both left-to-right and right-to-left I mean: "star" pointer in front of the identifier read right-to-left with return function type on the left and ampersand pointer read left-to-right with return function type on the right of "->"

Here are 2 of your function examples you gave:

This example has both an outermost "void" function return type on the left and another to the right of "->" "void (* f)(g &(int) -> void)" Instead use the following to always read left-to-right and get rid of the "star" pointer: "f &(g &(int) -> void) -> void"

Your next example has 2 "void" function return types on the left, and one "void" to the right of "->" "void (* f)(g &(void (* h)(x &int)) -> void)" Instead use the following to always read left-to-right: "f &(g &(h &(x &int) -> void) -> void) -> void"

I did say "original C declaration syntax forms needs to be deprecated and only the newly invented syntax forms should be used". So the new invented syntax alone is complete (you can express anything with &, (args)->ret and [count]type).
Ok, I got it now. Thanks. My faulty understanding.

I believe you have a clean, readable syntax for C declarations.

The Go Programming Language's declaration syntax is very similiar, except that "star" is retained and acts just as your "&"

I don't have a complete list, but have been looking at classifying programming languages into one of 2 categories:

Category 1 declaration syntax: Type identifier ;

or

Category 2 declaration syntax: identifier Type ; with perhaps a colon or other punctuation between the identifier and Type.

I did something wrong and the asterisk or star, representing C's pointer op, has been dropped from my prior posting. I apologize, also for poor formatting.
*An array of 4 ints

Great explanation though, it really helps to read things inside-out

Which part are you correcting?

The declaration "int bar[3];" is an array of 3 ints, which are bar[0], bar[1] and bar[2]. Declaration mimics use but it's not exactly the same; in this case the size replaces the indices, which are all less than it.