Hacker News new | ask | show | jobs
by hctaw 1861 days ago
I'm reminded of Chesteron's Fence in this.

Every major ABI is listed here as containing the same mistakes. I'm inclined to think the people who designed these ABIs were smart enough to understand the consequences of their design decisions.

I don't know whether this author is correct or not, but my gut is there is something missing here with respect to non local control flow (like exception handling, setjmp/longjmp, and fibers).

7 comments

I love seeing others bring up Chesterton's fence; it's been a reference that comes to mind with quite a lot of the WTFery I've encountered in my career (usually it remains WTFery even when looking for underlying reasons, but it at least helps remind me to question my instincts).

I don't really know enough to weigh in on this, but I can say that having pursued a lot of WTFish things in my career so far, 90% of the times I've encountered bad decisions, the explanation for it was either "it was done that way because legacy reasons" (i.e., it had to be done that way then, the reason it had to be has changed, and now it would break things to do it 'correctly') or "it was easier" (i.e., at the time the badness wasn't really going to affect anyone, or not measurably, or was very intentional tech debt, and it's only 'now' that anyone is noticing/caring).

I've seen people make bad architectural decisions that now the company is stuck with. And it comes down to just the fact that it was a bad decision, no second guessing needed.

I've also seen "bad" decisions made due to outside constraints. These decisions look like bad decisions, except that if you try to "fix" those decisions, it becomes a lot harder than it looks.

Don't get me wrong, there are plenty of times it was cluelessness. I'm just saying, I find myself going "this is stupid" far more often than it -was- stupid. It might be now, but the reasons for it then sometimes made sense.
In this case, "it was done that way because legacy reasons" is close, but the real answer is "it was done that way because we hadn’t yet invented the parts of compiler theory required to create compilers that enforce this constraint at the type level."
All this compiler sophistication represents a step backwards for binary interfaces. For example, C++ compilers emit such incredible machinery that it's essentially impossible for foreign code to interface with the compiled objects at the binary level. As a result everything eventually gets reduced to the C ABI: simple symbols and calling conventions.
That's... what we're talking about. Simple symbols with calling conventions.

The rules for this proposed ABI are exactly the same as the existing amd64-SystemV C ABI, with one difference: the stack-to-stack copies aren't generated at the call-site; instead, the generated code at the call-site passes the address (in a register, or spilled to stack) for what it would have copied. The compiler generates the stack-to-stack copy in the generated function's prologue, using the address it was passed. Nothing more, nothing less. It's just moving the required location for certain generated code across the linkage, and keeping a temporary alive a little bit longer to make that work. (And in exchange, the temporary that the local stack variable gets put in isn't created at the call-site, so the register-file "pressure" of the change is net neutral.)

This is no more or less complex than the current ABI. It doesn't create more exceptions or edge-cases than the current ABI. It doesn't make the ABI harder to implement. The only thing it does, is choose differently in the matter of a basically-arbitrary choice of where to put some generated glue code (the stack-to-stack copy).

The only practical upshot of this change, is that this enables compilers to sometimes do an optimization that they can't currently do, because doing said optimization would go against the rules of the amd64-SysV ABI (i.e. a caller that pushed a register instead of copying the value wouldn't be an amd64-SysV caller any more, and wouldn't be compatible with precompiled amd64-SysV callees any more; and vice-versa for the callee.)

But if-and-when a compiler does do that optimization, it's internal to the generated function. It doesn't mean that there are two potential callee "signatures" under the proposed ABI. There's only one.

Here's what the proposed ABI would probably say about stack copies:

> "The caller always passes large values by reference; the callee always receives them by reference. If the callee is taking a parameter pass-by-value, then it's up to the compiler of the callee to insert code into the callee's function prologue to turn the passed reference into a stack-local copy of the referenced data."

With that particular legalese, the callee's generated copy is still "required" by the spec, but its effects are now also "hidden" from the caller — i.e. its observable results are no longer leaking across the linkage. Therefore, the compiler is now empowered to optimize out the callee copy, as long as it can ensure the resulting code has observably equivalent results from the caller's perspective.

Note that this isn't anything the person implementing the ABI targeting code in the compiler has to worry about. They just write the code to generate a callee function prologue that does a stack-to-stack copy. It's the person writing the optimization pass that comes after that codegen step, who can now can take that stack-to-stack copy and — static proof of read-only access by the callee in hand — drop it out.

The optimization opportunity being enabled by the change, isn't part of the ABI's spec. The proposed ABI is just about moving the stack-to-stack copy into the callee. What the compiler chooses to do when targeting an ABI where the callee does stack-to-stack copies, is up to the compiler. Presumably, it will do "whatever fiendish things it can" at -O3, and "nothing much different" at -O0. Like usual.

And either way, the linkage itself looks the same. The optimization doesn't change the linkage. Any and all tooling that examines the linkage — debuggers, disassemblers, tracers, etc. — would see the same thing, whether the optimization has occurred or not. Because the optimization isn't part of the linkage; it's internal to the codegen of the callee, enabled by the (uniformly!) modified structure of the linkage.

Yup, there's also time dependence. Perhaps someone wrote some software in COBOL that is hard to maintain now. But rewritng it may not be worth the opportunity cost now, especially for well-tested systems that have been around for a long time and which have critical failure modes. Sometimes it's better to leave things alone and work around them, even if it results in an uglier design.
How about those explanations:

It didn't matter before, as compilers were not optimizing as much, code had a much closer 1:1 correspondence to assembly (if you are passing it by pointer and not register, you would want to make that clear in code).

It's much easier to implement in simple compilers. On the side of the callee you don't have to check if you manipulate your arguments, which is generally hard. Being able to manipulate your arguments is another shortcut for keeping the compiler simple. On the side of the caller you don't have to check if you hand out a mutable pointer.

Also finally and most importantly: memory access was much cheaper in terms of cpu cycles. Just look at cdecl: all parameters are passed on the stack instead of registers. Our current calling conventions stem from performance hacks like fastcall that were only optimizing for existing code (you pass big structs by pointer by convention).

> my gut is there is something missing here with respect to non local control flow (like exception handling, setjmp/longjmp, and fibers)

(Post author.)

Mechanically, what happens is essentially the same as what ms/arm/riscv do: the caller creates a reference and passes it to the callee. The only difference is that the callee is more restricted than it would otherwise have been in what it can do with the memory pointed to by that reference. So I don't think that there can possibly be any implications for non-local control flow.

Doesn't the referenced data have to be guaranteed to outlive the callee, which would only be true if the callee is guaranteed to return to the calling scope?

You can get around the immutability of the reference if your compiler implements the ABI with copy on write semantics, which I think is a reasonable compromise. But I'm still not certain how you would handle arbitrary control flow that the compiler may not be able to reason about.

If for example your arguments may be behind const references, how would you implement getcontext/swapcontext for your ABI? If everything is an integral value in registers or on the stack then it's really easy, but i would think it would have to be a compiler intrinsic if it depends on the function signature of the calling context, in order to perform the required copies.

Well for one, the language says a copy is made at the time of the function call, and it's perfectly valid to modify the original before the copy is finished being used. So pretty much any potentially aliasing write or function call in the callee would force a copy, and as he notes C's aliasing rules are lax enough that that's most of them.

Then if you care about the possibility of signal handlers modifying the original... you pretty much have to make a copy every time anyway.

Plus any potential concurrency synchro point existing would force a copy, plus using any unknown function, etc.

Using rust and propagating the single writer xor multiple readers requirement in an ABI, this might be interesting. But with C/C++, I'm afraid copies would be forced "all" the time.

There's still a lot of functions which don't call unknown functions before accessing an argument passed by value, don't take the argument's address, etc. There are many simple functions such as this one:

    void print_foo(FILE *outf, struct foo foo) {
        fprintf(outf, "foo '%s': %i, %i\n", foo.name, foo.x, foo.y);
    }
That one would gain a speed-up and code-bloat reduction from the proposed ABI, and there are many like it.

But even if every single function had to fall back to making a copy, the argument is that there's still a significant code bloat saving by putting the copy in the callee rather than in the caller. After all, the instructions necessary to make a copy takes some space, and with the proposed ABI, those instructions are put in the called function, rather than in every function call. Most functions are called more than once, and all functions are called at least once (hopefully), so anything which can be changed from O(number of function calls) to O(number of functions) is an improvement.

Exactly, see my example elsethread. Also in C and derivatives distinct objects are guaranteed to have distinct addresses. Implicit sharing would break this.
It wouldn't. The compiler would just have to generate the copy when the standard demands it (such as if the function body takes the address of the object).
Yes but then in many cases either (or both!) the caller and the callee might need to make a copy defeating the point of the optimization or even being worse than the original.
In many cases the callee would have to make a copy, yes. However:

1. In many cases, no copy would have to be made. There are lots of small non-complex functions out there where the compiler can prove that it's safe to not make a copy.

2. In many other cases, a copy has to be made. But the copy is made by the callee, not by the caller. That means that all the instructions necessary to copy the argument ends up in the binary once in the callee, rather than once for every function call, leading to less code bloat (which has its own performance advantages).

In fact, a stupid compiler could just always make a copy without analyzing the function body. This would result in a compiler which generates code that's about as fast as it would be with current ABIs, but with a smaller size.

You have to make a copy on the caller or the callee if the address of the object escapes, so you might end up with two extra copies even if nothing in the program mutates the object.
I don't understand how you achieve extra copies? My understanding is that the caller would never make a copy, it would always pass a pointer to large structs. So the absolute worst case, unless I'm missing something, is that we end up with the same number of copies as we do today (i.e one copy per large struct passed as a parameter).
Sometimes a mistake is a decision under the assumption that the people intended to use this are smarter / more careful than they are.
> A correctly-specified ABI should pass large structures by immutable reference

is just not possible. CPUs don't know about `const`. So you have to work with the assumption that functions that you call can do anything to their arguments. Thus copies cannot be avoided.

The CPU also doesn't know what an ABI is
An ABI also has a concept of defined and undefined behaviour. You can design an ABI that is fully protected against abuse but often the performance penalty for that will be huge.

Instead what you'll do is specify the constrained inputs and expected output behaviour. From there you can out anything that violates those constraints as non-conformant. As long as you maintain those constraints between versions, there's no ABI breakage.

Also you can absolutely have constant references in an ABI. There may be ways of ignoring the const depending on how you design the ABI but they will be obvious abuse.

CPUs actually do know about const; it's called a read-only page.

Besides, that's irrelevant. There's nothing stopping my function from following every pointer on the stack and smashing up its contents; are you going to defend against that, too? If not, how is this any different?