Hacker News new | ask | show | jobs
by adastra22 1115 days ago
Those are syntactic sugar for the same thing though. Array[5] is just shorthand for *(Array + 5), which is why 5[Array] also works (because addition is commutative).

Note that C does have strong conventions, such as that strings are terminated by a zero byte. Nothing in the language demands that, it’s just a convention! C could adopt better conventions.

3 comments

> Note that C does have strong conventions, such as that strings are terminated by a zero byte

Stated the same on HN earlier, but someone pointed out that literal strings are ASCIIZ.

> literal strings are ASCIIZ.

If only. In C, it’s a (95+5)-item character set that happens to be a subset of ascii. See https://en.cppreference.com/w/c/language/charset:

“The basic literal character set consists of all characters of the basic character set, plus the following control characters”

That page also explicitly says:

The following characters are not in basic execution character set, but they are required to be encoded as a single byte in an ordinary character constant or ordinary string literal.

  Code unit Character Glyph

  U+0024 Dollar Sign $
  U+0040 Commercial At @
  U+0060 Grave Accent `”*
If I read that correctly, if you write a ‘$’ in a string literal before C23, there’s no guarantee that if gives you a byte with value 0x24.

Of course, C++ is different. Like C, it makes a distinction between the encoding of source files (nowadays called the “basic character set”) and the encoding that the compiler converts literals to (nowadays called the “basic literal character set”), but it seems to put even fewer restrictions on them (in my cursory reading)

Also (https://en.cppreference.com/w/cpp/language/charset):

“Mapping from source file (other than a UTF-8 source file) (since C++23) characters to the basic character set (until C++23) translation character set (since C++23) during translation phase 1 is implementation-defined, so an implementation is required to document how the basic source characters are represented in source files.”*

If I understand that correctly, you can’t portably write an euro sign in C++ source files in C++ foe C++23

Also, chances are this changed in subtle ways between C and C++ versions.

One common trick in safer C libraries is to encode the length of the string one word prior to the beginning of the string. So "hello world" in memory would be

11 'h' 'e' 'l' 'l' 'o' ' ' 'w' 'o' 'r' 'l' 'd' '\0'

ptr ^

C could be upgraded to do this in future versions, without too much backwards incompatibility.

From the C99 draft at https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf :

"A string is a contiguous sequence of characters terminated by and including the first null character. .. The length of a string is the number of bytes preceding the null character"

This means, for example, strlen() must always check for the location of the first null character - there's no advantage to checking the length.

How would this work?

  void *x = malloc(8);

   ...
  uint64_t i = 5216694956355289088; // Python: int.from_bytes(b'Hello!\0\0')
  memcpy(x, &i, 8);
  char *s = x;
  puts(s);
Assuming I did it correctly, this should print "Hello!".

When the length get added to the start of the string?

> C could be upgraded to do this in future versions, without too much backwards incompatibility.

But I'd hope that doing that would always be optional. There are numerous situations where that would seriously get in the way.

Could you mention one of them?
Strings can point anywhere in the malloc'ed region:

  char buffer[] = "railroad";
  char *s = buffer;
  char *t = buffer + 4;
  printf("mult: %ld\n", strlen(s) * strlen(t));
Suppose I read 100 bytes, formatted as "{name}\t{rank}\t{serial number}\t" using variable length parts.

I can read the data into a single string buffer, replace the commas with NULs, and set up strings pointing to the middle of the buffer;

   typedef struct {char buf[101], char *name, char *rank, char *serialno} person;

   /* 100 bytes formatted as: name\trank\tserial no\t. */
   int read_data(FILE *f, person *p) {
     char *s;
     if (fread(p->buf, 1, 100, f) != 100) return -1;
     p->buf[100] = 0;
     p->name = p->buf;
     if ((s = strchr(p->buf, '\t') == NULL) return -2;
     *s = 0;
     p->rank = s+1;
     if ((s = strchr(s+1, '\t') == NULL)) return -2;
     *s = 0;
     p->serialno = s+1;
     if ((s = strchr(s+1, '\t') == NULL)) return -2;
     *s = 0;
     return 0;
   }

   person subject;
   if (read_data(stdin, &subject)) fail("cannot read.");
   print("Hello %s %s.\n", subject.rank, subject.name);
   ...
Even better, the protocol might have NUL characters already in the code, expecting C strings to point to the correct start.
Sure. For instance, there are times when you need to pack strings tightly together. Adding an extra byte or two before the start of the string would get in the way. You could work around it in many cases, but it makes the code uglier and harder to understand/maintain.

One of the things that makes C particularly suitable for certain sorts of tasks is that it's mostly WYSIWYG when it comes to the relationship between data structures and the actual memory layout. Having "hidden" things like a length value before the string steps on that.

I agree on the first paragraph, but the second one applies poorly to strings:

  char *s = "hello";
"hello" has length 6 because there's a hidden \0 even if I never wrote it in the code.
if you wanted to pack strings together tightly, couldn't your string library have a separate "array" concept where all the sizes are stored separately?
My copy of the C standard says "A string is a contiguous sequence of characters terminated by and including the first null character."
Many of the str functions in the C standard library assume a nul terminator.
Yes, but aside from string literals pointed out by a sibling comment, nothing in the language itself dictates this convention. The C library could be augmented with functions which expect strings structured in other ways.
> nothing in the language itself dictates this convention.

String literals are nul-terminated, e.g.: "foo"[3] == '\0'