Hacker News new | ask | show | jobs
by eqvinox 1115 days ago
The problem with C and buffer overflows isn't that you can't guard against them, or that there is no existing, reusable code to do so — it's that none of this functionality is standardized. Adding another one to the existing 41383 ways of doing this is in fact the exact opposite of what's needed. Ideally C needs one way of doing this, and that would be described in the standard.

But that's not how C "rolls", and we'll never get that. So I guess we now have 41384 ways to do buffer overflow guards.

4 comments

There is value in actually understanding what someone is doing in regards to protecting against buffer overflows, instead of relying on well established patterns.
Not when I’m trying to orchestrate third party libraries.
C never has just one way to do something. myArr[5] == 5[myArr] == (insert pointer arithmetic that I won't write here without a compiler check). I think that part of C's beauty is that it gives you freedom. Freedom to shoot yourself in the foot, freedom to write hyper efficient code, and freedom to choose another tool.

I agree that this will never be implemented as a standard, but I think that's a good thing. Higher level languages push against their boundaries non stop. Java has libraries and frameworks that fundamentally change the syntax and functionality of the language. C knows what it is. If you want something that it can't do it promises that you can either build it yourself or switch to a different tool.

All of this to say, C has a single suggested way of doing this: using a different language. That's part of why we built them

Those are syntactic sugar for the same thing though. Array[5] is just shorthand for *(Array + 5), which is why 5[Array] also works (because addition is commutative).

Note that C does have strong conventions, such as that strings are terminated by a zero byte. Nothing in the language demands that, it’s just a convention! C could adopt better conventions.

> Note that C does have strong conventions, such as that strings are terminated by a zero byte

Stated the same on HN earlier, but someone pointed out that literal strings are ASCIIZ.

> literal strings are ASCIIZ.

If only. In C, it’s a (95+5)-item character set that happens to be a subset of ascii. See https://en.cppreference.com/w/c/language/charset:

“The basic literal character set consists of all characters of the basic character set, plus the following control characters”

That page also explicitly says:

The following characters are not in basic execution character set, but they are required to be encoded as a single byte in an ordinary character constant or ordinary string literal.

  Code unit Character Glyph

  U+0024 Dollar Sign $
  U+0040 Commercial At @
  U+0060 Grave Accent `”*
If I read that correctly, if you write a ‘$’ in a string literal before C23, there’s no guarantee that if gives you a byte with value 0x24.

Of course, C++ is different. Like C, it makes a distinction between the encoding of source files (nowadays called the “basic character set”) and the encoding that the compiler converts literals to (nowadays called the “basic literal character set”), but it seems to put even fewer restrictions on them (in my cursory reading)

Also (https://en.cppreference.com/w/cpp/language/charset):

“Mapping from source file (other than a UTF-8 source file) (since C++23) characters to the basic character set (until C++23) translation character set (since C++23) during translation phase 1 is implementation-defined, so an implementation is required to document how the basic source characters are represented in source files.”*

If I understand that correctly, you can’t portably write an euro sign in C++ source files in C++ foe C++23

Also, chances are this changed in subtle ways between C and C++ versions.

One common trick in safer C libraries is to encode the length of the string one word prior to the beginning of the string. So "hello world" in memory would be

11 'h' 'e' 'l' 'l' 'o' ' ' 'w' 'o' 'r' 'l' 'd' '\0'

ptr ^

C could be upgraded to do this in future versions, without too much backwards incompatibility.

From the C99 draft at https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf :

"A string is a contiguous sequence of characters terminated by and including the first null character. .. The length of a string is the number of bytes preceding the null character"

This means, for example, strlen() must always check for the location of the first null character - there's no advantage to checking the length.

How would this work?

  void *x = malloc(8);

   ...
  uint64_t i = 5216694956355289088; // Python: int.from_bytes(b'Hello!\0\0')
  memcpy(x, &i, 8);
  char *s = x;
  puts(s);
Assuming I did it correctly, this should print "Hello!".

When the length get added to the start of the string?

> C could be upgraded to do this in future versions, without too much backwards incompatibility.

But I'd hope that doing that would always be optional. There are numerous situations where that would seriously get in the way.

Could you mention one of them?
My copy of the C standard says "A string is a contiguous sequence of characters terminated by and including the first null character."
Many of the str functions in the C standard library assume a nul terminator.
Yes, but aside from string literals pointed out by a sibling comment, nothing in the language itself dictates this convention. The C library could be augmented with functions which expect strings structured in other ways.
> nothing in the language itself dictates this convention.

String literals are nul-terminated, e.g.: "foo"[3] == '\0'

Checked arithmetic has been implemented in the standard with `ckdint.h`, so give it 50 more years!
> Ideally C needs one way of doing this, and that would be described in the standard.

I'm really glad that C doesn't do this, personally. It would reduce one of the main advantages of the language.

> existing, reusable code to do so

Is there a library that you recommend for this?