Hacker News new | ask | show | jobs
by adastra22 1115 days ago
One common trick in safer C libraries is to encode the length of the string one word prior to the beginning of the string. So "hello world" in memory would be

11 'h' 'e' 'l' 'l' 'o' ' ' 'w' 'o' 'r' 'l' 'd' '\0'

ptr ^

C could be upgraded to do this in future versions, without too much backwards incompatibility.

2 comments

From the C99 draft at https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf :

"A string is a contiguous sequence of characters terminated by and including the first null character. .. The length of a string is the number of bytes preceding the null character"

This means, for example, strlen() must always check for the location of the first null character - there's no advantage to checking the length.

How would this work?

  void *x = malloc(8);

   ...
  uint64_t i = 5216694956355289088; // Python: int.from_bytes(b'Hello!\0\0')
  memcpy(x, &i, 8);
  char *s = x;
  puts(s);
Assuming I did it correctly, this should print "Hello!".

When the length get added to the start of the string?

> C could be upgraded to do this in future versions, without too much backwards incompatibility.

But I'd hope that doing that would always be optional. There are numerous situations where that would seriously get in the way.

Could you mention one of them?
Strings can point anywhere in the malloc'ed region:

  char buffer[] = "railroad";
  char *s = buffer;
  char *t = buffer + 4;
  printf("mult: %ld\n", strlen(s) * strlen(t));
Suppose I read 100 bytes, formatted as "{name}\t{rank}\t{serial number}\t" using variable length parts.

I can read the data into a single string buffer, replace the commas with NULs, and set up strings pointing to the middle of the buffer;

   typedef struct {char buf[101], char *name, char *rank, char *serialno} person;

   /* 100 bytes formatted as: name\trank\tserial no\t. */
   int read_data(FILE *f, person *p) {
     char *s;
     if (fread(p->buf, 1, 100, f) != 100) return -1;
     p->buf[100] = 0;
     p->name = p->buf;
     if ((s = strchr(p->buf, '\t') == NULL) return -2;
     *s = 0;
     p->rank = s+1;
     if ((s = strchr(s+1, '\t') == NULL)) return -2;
     *s = 0;
     p->serialno = s+1;
     if ((s = strchr(s+1, '\t') == NULL)) return -2;
     *s = 0;
     return 0;
   }

   person subject;
   if (read_data(stdin, &subject)) fail("cannot read.");
   print("Hello %s %s.\n", subject.rank, subject.name);
   ...
Even better, the protocol might have NUL characters already in the code, expecting C strings to point to the correct start.
Sure. For instance, there are times when you need to pack strings tightly together. Adding an extra byte or two before the start of the string would get in the way. You could work around it in many cases, but it makes the code uglier and harder to understand/maintain.

One of the things that makes C particularly suitable for certain sorts of tasks is that it's mostly WYSIWYG when it comes to the relationship between data structures and the actual memory layout. Having "hidden" things like a length value before the string steps on that.

I agree on the first paragraph, but the second one applies poorly to strings:

  char *s = "hello";
"hello" has length 6 because there's a hidden \0 even if I never wrote it in the code.
if you wanted to pack strings together tightly, couldn't your string library have a separate "array" concept where all the sizes are stored separately?