| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by astrobe_ 1116 days ago
	> Note that C does have strong conventions, such as that strings are terminated by a zero byte Stated the same on HN earlier, but someone pointed out that literal strings are ASCIIZ.

2 comments

Someone 1116 days ago

> literal strings are ASCIIZ.

If only. In C, it’s a (95+5)-item character set that happens to be a subset of ascii. See https://en.cppreference.com/w/c/language/charset:

“The basic literal character set consists of all characters of the basic character set, plus the following control characters”

That page also explicitly says:

The following characters are not in basic execution character set, but they are required to be encoded as a single byte in an ordinary character constant or ordinary string literal.

  Code unit Character Glyph

  U+0024 Dollar Sign $
  U+0040 Commercial At @
  U+0060 Grave Accent `”*

If I read that correctly, if you write a ‘$’ in a string literal before C23, there’s no guarantee that if gives you a byte with value 0x24.
Of course, C++ is different. Like C, it makes a distinction between the encoding of source files (nowadays called the “basic character set”) and the encoding that the compiler converts literals to (nowadays called the “basic literal character set”), but it seems to put even fewer restrictions on them (in my cursory reading)
Also (https://en.cppreference.com/w/cpp/language/charset):

“Mapping from source file (other than a UTF-8 source file) (since C++23) characters to the basic character set (until C++23) translation character set (since C++23) during translation phase 1 is implementation-defined, so an implementation is required to document how the basic source characters are represented in source files.”*

If I understand that correctly, you can’t portably write an euro sign in C++ source files in C++ foe C++23

Also, chances are this changed in subtle ways between C and C++ versions.

link

adastra22 1116 days ago

One common trick in safer C libraries is to encode the length of the string one word prior to the beginning of the string. So "hello world" in memory would be

11 'h' 'e' 'l' 'l' 'o' ' ' 'w' 'o' 'r' 'l' 'd' '\0'

ptr ^

C could be upgraded to do this in future versions, without too much backwards incompatibility.

link

eesmith 1116 days ago

From the C99 draft at https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf :

"A string is a contiguous sequence of characters terminated by and including the first null character. .. The length of a string is the number of bytes preceding the null character"

This means, for example, strlen() must always check for the location of the first null character - there's no advantage to checking the length.

How would this work?

  void *x = malloc(8);

   ...
  uint64_t i = 5216694956355289088; // Python: int.from_bytes(b'Hello!\0\0')
  memcpy(x, &i, 8);
  char *s = x;
  puts(s);

Assuming I did it correctly, this should print "Hello!".

When the length get added to the start of the string?

link

JohnFen 1116 days ago

> C could be upgraded to do this in future versions, without too much backwards incompatibility.

But I'd hope that doing that would always be optional. There are numerous situations where that would seriously get in the way.

link

ranger_danger 1116 days ago

Could you mention one of them?

link

eesmith 1116 days ago

Strings can point anywhere in the malloc'ed region:

  char buffer[] = "railroad";
  char *s = buffer;
  char *t = buffer + 4;
  printf("mult: %ld\n", strlen(s) * strlen(t));

Suppose I read 100 bytes, formatted as "{name}\t{rank}\t{serial number}\t" using variable length parts.

I can read the data into a single string buffer, replace the commas with NULs, and set up strings pointing to the middle of the buffer;

   typedef struct {char buf[101], char *name, char *rank, char *serialno} person;

   /* 100 bytes formatted as: name\trank\tserial no\t. */
   int read_data(FILE *f, person *p) {
     char *s;
     if (fread(p->buf, 1, 100, f) != 100) return -1;
     p->buf[100] = 0;
     p->name = p->buf;
     if ((s = strchr(p->buf, '\t') == NULL) return -2;
     *s = 0;
     p->rank = s+1;
     if ((s = strchr(s+1, '\t') == NULL)) return -2;
     *s = 0;
     p->serialno = s+1;
     if ((s = strchr(s+1, '\t') == NULL)) return -2;
     *s = 0;
     return 0;
   }

   person subject;
   if (read_data(stdin, &subject)) fail("cannot read.");
   print("Hello %s %s.\n", subject.rank, subject.name);
   ...

Even better, the protocol might have NUL characters already in the code, expecting C strings to point to the correct start.

link

JohnFen 1116 days ago

Sure. For instance, there are times when you need to pack strings tightly together. Adding an extra byte or two before the start of the string would get in the way. You could work around it in many cases, but it makes the code uglier and harder to understand/maintain.

One of the things that makes C particularly suitable for certain sorts of tasks is that it's mostly WYSIWYG when it comes to the relationship between data structures and the actual memory layout. Having "hidden" things like a length value before the string steps on that.

link

teo_zero 1116 days ago

I agree on the first paragraph, but the second one applies poorly to strings:

  char *s = "hello";

"hello" has length 6 because there's a hidden \0 even if I never wrote it in the code.

link

ranger_danger 1115 days ago

if you wanted to pack strings together tightly, couldn't your string library have a separate "array" concept where all the sizes are stored separately?

link