Hacker News new | ask | show | jobs
by alexey-salmin 29 days ago
Interestingly the article doesn't mention two-dimensional arrays and they're curios because they bring a certain asymmetry with them. It always tripped me over the most in C because I otherwise find the language very "symmetrical". It often feels like in design of this language the beauty of expressing certain things took priority over readability or safety which I admire in a way. But somehow not in the case of the two-dimensional arrays.

If you see a[i][j] it could mean two completely different things:

1) "a" is a continuous chunk of memory of N*M bytes, so it behaves as char*; a[i][j] == *(a + i*M + j)

2) "a" is an array of char* pointers that point to N completely distinct memory chunks of size M, so it behaves as char**; a[i][j] == *(*(a + i) + j)

With flat arrays the difference between an array as a variable and a pointer to the first element is literally negligible because you won't even see the difference in the assembly. This is why the automatic decay-to-pointer makes a lot of sense.

But that breaks completely with multiple dimensions. You definitely see the difference in the assembly because the memory layout is so different.

6 comments

> If you see a[i][j] it could mean two completely different things:

> 1) ... a[i][j] == *((char*)a + i*M + j) // I added the char* cast to make it correct

> 2) ... a[i][j] == *(*(a + i) + j)

You may already understand this but: even in case (1), you still have

   a[i][j] = *(*(a + i) + j)
(It has to - that's what operator[] means in C.)

It's just that, in this case, `a + i` is applying pointer arithmetic to char[M]* so it adds M * i bytes to a's address.

This is similar to how `a + i`, if a is int32_t*, will give you an address 4 * i bytes bigger than a.

Really the confusing part of this is that *(a + i), which is an array value i.e. has type char[M], decays to char* when you add an integer to it (or dereference it). This is a pretty crazy hack really. Imagine if, in C++, you could do this

   std::vector<int> v = {1, 2, 3};
   int* x = v + 1;   // equivalent to &v[1]
Yuck.
Too late to edit but I wrote pointer to char[M] as char[M]* when, of course(!), it should be written as char(*)[M].
"breaks completely"

I rather would say it works nicely in auto-generating the complex indexing operation for n-dimensional arrays which makes it a lot more convenient and less error-prone to write such code. The compiler may also flatten a loop.

The array of pointer hack used previously to similate 2d arrays using an array to pointers to arrays should not be used outside of special algorithms, as it is error prone and slow.

> The compiler may also flatten a loop.

http://c2.com/cgi/wiki?SufficientlySmartCompiler

In practice, C compilers are still notoriously bad at loop optimizations.

Polyhedral optimizations provided some hope, but no compiler managed to adopt it in production.

Maybe, but also irrelevant to the discussion because whether you write mat[b * A + a] by hand or mat[b][a] and let the compiler frontend expand then makes no difference to the optimizer.
You missed the point.

Those two representations are equivalent, yes. But that's not what flattening loops mean.

You are telling me what my point was?
As I recall, C# supports this in a completely sensible way by distinguishing a[i,j] and a[i][j]. If I understand right, in C, a[i][j] means what C# would spell a[i,j], which does seem rather surprising and inconsistent
Not quite. As GP mentions, a[i][j] might mean either, depending on what the type of a is:

(a) If the type of a is “array of length N of pointer to (say) char” (declaration: char *a[N]), then a[i][j] means the jth char in the contiguous block pointed to by the ith pointer. In C#, this is what you get with an array of arrays.

(b) If the type of a is “array of length N of array of length M of char” (declaration: char a[N][M] — sic!), then a[i][j] means the jth element of the ith element, aka the (i*M+j)th char in the single contiguous memory block. In C#, this is what you get with a two-dimensional array.

The way this happens is a bit subtle:

(a) The value a, of type “array of size N of pointer to char”, first decays into “pointer to pointer to char”, then a[i] retrieves the ith “pointer to char” starting from it as a base, then in turn a[i][j] retrieves the jth “char” starting from that as a base.

(b) The value a, of type “array of length N of array of length M of char”, first decays into “pointer to array of length M of char” (sic!), then a[i] retrieves the ith “array of length M of char” starting from it as a base, which then decays into “pointer to char”, then a[i][j] retrieves the jth “char” starting from that as a base.

NB: There are no implicit references here, unlike in C#; in part (b), a is an N*M-byte chunk of memory and a[i] is an M-byte piece of it.

In C, a[i][j] can mean either a[i,j] or a[i][j], depending on the type of a.
And just in case you have not come across this, C++ allows you you overload all the relevant operators here: [], *, ->

So, you really can't tell what's going on behind the scenes.

I wanted to pull my hair out seeing some 'enterprise' code use

  state[i] = foo;
for some kind of logging where i was the severity level. There were even instances of state[i++], where the severity was incremental. I hope someone has rewritten that codebase with AI by now.
So you would be equally critical of overloading [] for maps?

Sorry, hard for me to relate, as I've overloaded [] (in, say, Python) to make life easy on everyone. People loved it.

I hope you're aware that there is a long standing debate on whether overloading operators is good/bad, and it comes down to personal preference?

> So you would be equally critical of overloading [] for maps?

No, I'm not sure how you got that impression. Overloading is great.

It's also confusing when it does something completely different from what you intuitively expect.

If you're an old timer, you expect [] to index into an array, and you definitely do not expect it to do a lookup in a map/dict. Older languages just didn't include a dict-like structure in the language (technically, even C++ didn't).

Do you not see that for them, overloading [] for dictionary lookups is "something completely different from what you intuitively expect"?

More strikingly, C++ doesn't distinguish what Rust would call IndexMut and just Index, the use of the [idx] operator in a context where we'll mutate things and one where we don't want that.

Rust's HashMap implements Index because answer = map[name] is a perfectly reasonable thing to do, and if there is no key matching name then we panic, makes sense but it does not provide IndexMut so that you could write map[name] = answer because the edge cases are non-obvious so better to make you write what you meant.

C++ hash tables implement operator[] but the result is mutable, in order that map[name] = 123.0 can work which in Rust would be IndexMut and isn't provided. Because this is true, the index operation always succeeds, if you ask for map["not-present"] it creates the hash table entry for "not-present" and tries to store a default value ready to update it if you later assign to the reference.

For 1), you can just write (&a[i])[j] .
I mean, just like with 1 dimensional arrays, it depends on the context.

Array memory is on the stack. The size of that array is actually not known at run time, its only known at compile time, where any reference to that length gets resolved by the compiled.

If your 2d array sits on the stack, then inferring memory layout is pretty easy. If you are dealing with pointer that was passed to a function, then you can't assume anything about data size or limits, which is why many functions that take pointers take a size parameter as well.

> If your 2d array sits on the stack, then inferring memory layout is pretty easy. If you are dealing with pointer that was passed to a function, then you can't assume anything about data size or limits, which is why many functions that take pointers take a size parameter as well.

Right, but 2d arrays come into this picture with their own quirks again. You're not just passing the size as the parameter, you can pass it as a "special" parameter that influences how the compiler will interpret other parameters. E.g. in C99 you can do this:

    void do(size_t x, size_t y, int a[][y]);
Here "y" plays the critical role because it will be used to compute offsets in the a[i][j] expression. For 1d arrays this doesn't happen.

Of course it's still generalizable as "all but the outermost dimensions should be known" and for 1d array the outermost dimension is the only dimension. Still, this whole thing always felt a bit odd to me.

Well, you give the explanation yourself: The size for the outermost array is not always needed, and then C allows it to be omitted.

But my recommendation is to always give the size and then everything is regular and the compiler can use the information for warnings.

> Array memory is on the stack.

Array memory can sit on either the stack or the heap.

> The size of that array is actually not known at run time, its only known at compile time, where any reference to that length gets resolved by the compiled.

This is also a bit misleading, in two ways. First, it's not clear what you mean by "size" here - the size of the memory block(s), or the shape of the array?

Second, many people think that the C runtime doesn't know the amount of memory allocated to an array, but this is actually false. It's just the C abstract model that for some reason chose to not expose this information - but the size is actually always stored and accessible, and this is virtually mandated by the standard: otherwise, `free(arr)` couldn't realistically work, it would have to be `free(arr, size)`. This is one of the weirdest inefficiencies of C, in fact - it requires you to store the size of arrays twice - once in user code, and another time in the internal logic of the allocator.

Edit: and as a fun extra, C++ not only inherited this mistake from C, but reproduced it again, meaning that a C++ array allocated with new[] actually stores the size twice, at least with typical implementations - once in the C++ runtime and again in the allocator - and still requires the user-space code to store it a third time. This is because `delete[]` needs to call the destructors of all of the elements of the array, regardless of where and how the array was allocated, so the number of array elements needs to be stored alongside the object itself.

>Array memory can sit on either the stack or the heap.

No, if we are using the definition of an array that is like int c[] = ..., that is always going to be on the stack. Heap continuous memory =/= array. You can use the [] operator to access it like an array, but fundamentally, as far as structures in C language are concerned, those 2 are different, because they get treated by compiler differently.

>but the size is actually always stored and accessible, and this is virtually mandated by the standard: otherwise, `free(arr)` couldn't realistically work,

That would only be true if each element in the array was a char.

The dynamic data structure stores total amount of memory allocated by address, it has no info about the size of the element, so it can't infer the actual number of items at runtime. You could write your own malloc that does this, but generally, that is left to the user for flexibility. For example, a really good practice in C coding that basically solves any double free is a mempool that allocates all the memory up front. That way, you never really even have to call free, and the memory you allocate can be partitioned any way you chose dynamically.

> that is always going to be on the stack.

Unless your C implementation doesn't use a stack for data storage. Which existed, you know: IIRC the C compilers for Cray machines used linked lists to hold activation frames. And of course, there are PIC microcontrollers where you can't really use the hardware stack for anything except return addresses.

> int c[] = ..., that is always going to be on the stack

Why? In the following code, only c will be allocated on the stack:

  int a[]={1,2,3};
  foo() {
    static int b[]={1,2,3};
    int c[]={1,2,3};
  }
The point is anything that is not dynamically allocated has size known at compile time.
> No, if we are using the definition of an array that is like int c[] = ..., that is always going to be on the stack. Heap continuous memory =/= array. You can use the [] operator to access it like an array, but fundamentally, as far as structures in C language are concerned, those 2 are different, because they get treated by compiler differently.

Well, not necessarily. For one thing, if we have a function foo(int c[]), it's debatable if c is an array variable or a pointer variable. However, what's not debatable is that you can allocate a struct on the heap, and that struct can have an array member - e.g. `struct foo { int a[10]; }; [...] struct foo *x = malloc(sizeof(struct foo));` would allocate an array on the heap as part of the struct.

> That would only be true if each element in the array was a char.

That's why I said that it depends on what exactly you mean by the size of the array. It's also true that in today's world at least, malloc() will often allocate more memory than you actually ask for, to optimize against fragmentation - and then the internally stored size is the size of the actual allocation, not the logical size that you requested - which may not even fit into a whole number of array elements. So, I was being a little overly simplistic (lying) for dramatic effect.

> For example, a really good practice in C coding that basically solves any double free is a mempool that allocates all the memory up front.

While this is a very valid technique for certain purposes, especially when dynamic allocation is needed in very high performance code, it's very much not a valid solution for memory safety - quite the contrary, it's a terrible practice for that. In particular, this is almost exactly the issue that caused the infamous HeartBleed vulnerability in OpenSSL to stay hidden for so long: the use of a memory pool for the buffers used to store TLS packets was hiding the buffer overflow from UBSan and valgrind and similar tools, since the reads were perfectly valid from a language perspective (they were never reading from free()d/unallocated memory, only from memory that had been released to the memory pool).

Its all about what the compiler sees.

Structs are a defined type, which means its construction (and therefore total size) has to be known , the array definition with size is necessarily part of that struct type. So anytime that struct is used, the compiler needs to see its definition, and thus can safely infer the size. Thats pretty much the whole reason structs are a thing - the very basic type that allows you to pass around data format during the compilation process.

Arrays are not defined as types in C, they are really at most just syntax convenience. So if another function takes an array as a parameter, and it gets compiled as part of a file, there is no way for the compiler to auto infer what would get passed into it.

Char allocation usually involves +1 bytes for null terminated strings, which is used as a signal for allocated memory. So strlen(char *) is accurate.

>quite the contrary, it's a terrible practice for that. In particular, this is almost exactly the issue that caused the infamous HeartBleed vulnerability in OpenSSL to stay hidden for so long: the use of a memory pool for the buffers used to store TLS packets

The heartbleed vulnerability was not due to mempool. It was due to a combination of lack of bounds checking, and not zeroing out the memory containing secure keys when its deallocated. Even if it didn't use mempool, leaks would still be possible.

Even for char*, it's very possible that malloc() will store more memory than strictly required. But you're right, `char x[] = "abc"` will require a minimum of 4 bytes wherever x gets allocated (stack or global segment).

> The heartbleed vulnerability was not due to mempool. It was due to a combination of lack of bounds checking, and not zeroing out the memory containing secure keys when its deallocated. Even if it didn't use mempool, leaks would still be possible.

I didn't say that the bug was caused by the mempool, I said that the bug was very hard to find by regular tools such as valgrind and UBSan because it used mempools instead of regular allocations - so that all of the logical out of bounds accesses were not actually UB nor were they accessing unallocated memory, which those tools could have caught.

> Second, many people think that the C runtime doesn't know the amount of memory allocated to an array, but this is actually false. It's just the C abstract model that for some reason chose to not expose this information.

There are some counterpoints:

1) Conceptually, allocated memory block and data structure / array in it are not related. You can allocate memory block and then subdivide it to multiple different structures / arrays. You can implement sub-allocators.

2) Heap allocator does not need to store exact length of allocated object. For example, it could have several fixed-length slab allocators for smaller objects, select matching one during malloc() and use address range to find slab during free().

3) Array can be also on the stack (VLA or alloca()).

4) Arrays can be also on memory allocated outside of C library allocator (e.g. mmap()).

All are fair points, I was being a bit cavalier with the facts. I'll also add that many if not all modern malloc() implementations actually allocate somewhat larger amounts of memory than your request, to respect various alignment requirements and/or to avoid excessive fragmentation - even when not using pure slab allocations.

I do think the C++ bookkeeping from new[]/delete[] however has few if any similar caveats - the runtime really needs exactly the kind of information you also need in your code; the only caveat I can imagine is that it might omit this information for types that don't need destruction, such as `int`, but I don't know if this is a plausible optimization in realistic use cases that are not trivial.