Hacker News new | ask | show | jobs
by matzf 1698 days ago
This is not the point of the article, and maybe I'm just tired, but I'm confused; multiple paragraphs mention increased stack size with inlining:

> When a C compiler decides to not inline, there is likely a good reason. For example, inlining would reuse a register which require to save/restore the register value on the stack and so increase the stack memory usage or be less efficient.

> On the other side, the Py_NO_INLINE macro can be used to disable inlining. It is useful to reduce the stack memory usage.

This seems completely backwards, what are they talking about?

5 comments

I think their reasoning appears earlier:

> When a C compiler decides to not inline, there is likely a good reason. For example, inlining would reuse a register which require to save/restore the register value on the stack and so increase the stack memory usage or be less efficient.

And ... that removes all doubt. They are wrong. If a calculation requires an extra register, doing a function call won't conjure it out of thin air. It has to spill the register too, and it will push the PC along with it.

It's still possible a doing function call rather than inlining will speed up code on modern CPU's. The repeated code the inlining generates extra demands on the caches.

You might not be considering the big picture. If the extra register is used in a callee that wasn't inlined, that slot on the stack will be free to be reused by other callees that aren't inlined. So inlining into a function with multiple callees can increase the peak stack usage for the entire callgraph.
I guess `flatten` attribute could be used here to help the situation?

    flatten

      Generally, inlining into a function is limited.

      For a function marked with this attribute, every call 
      inside this function is inlined, if possible.

      Functions declared with attribute noinline and similar are not inlined.

      Whether the function itself is considered for inlining depends on 
      its size and the current inlining parameters.
Suppose you have a function `a` that calls two other functions in succession, `b` and `c`. Further suppose `b` and `c` each have one large stack variable (e.g. an array). Ideally we should be able to reuse the same stack memory for both variables, since they aren't in use at the same time.

There are three cases: neither `b` or `c` is inlined; both are inlined; or only one is inlined. If neither is inlined, the memory is always reused. If both are inlined, the memory is usually reused but not always. If only one is inlined, the memory is never reused.

Therefore, marking functions no-inline can indeed reduce stack usage, but it depends on the situation.

Details:

Case 1: Neither `b` or `c` is inlined. Then `b` will push its stack frame and pop it when it's done, then `c` will do the same with its stack frame, reusing the same memory.

Case 2: Both `b` and `c` are inlined. Then both of their variables will be part of `a`'s stack frame. A naive compiler would put each variable in a separate location in the stack frame, so the size of the stack frame would be at least the sum of `b` and `c`'s variables, wasting stack space. However, most compilers can determine that the variables' lifetimes don't overlap and reuse the same part of the stack frame for both. (LLVM calls this "stack coloring", for reference.)

Most compilers, but not all. In a simple test on gcc.godbolt.org, GCC, MSVC, and Clang all normally perform this optimization at all optimization levels [1]. But ICC (Intel C compiler) fails to perform it, allocating space for both variables even at -O3. And there are many more obscure C compilers (not that commonly used these days, but they exist), some of which presumably have the same problem.

Case 3: One of `b` and `c` is inlined but the other isn't. Suppose `b` is the one inlined. `b`'s variable will be incorporated into `a`'s stack frame, but when `a` then calls `c`, `c` will push its stack frame on top of `a`'s. In theory, `a` could dynamically reduce the size of its stack frame before calling `c`, but as far as I know, no major compilers do this, regardless of the optimization level. Therefore, the memory cannot be reused.

[1] Test case: https://gcc.godbolt.org/z/nY4ddz7q1

Note: Clang actually doesn't perform the optimization at -O0, but at -O0 functions are never inlined unless forced to be with always_inline, which should be used sparingly. So it's not a concern in typical situations. MSVC, for its part, doesn't inline functions at /O0 even if they are marked __forceinline.

Reading between the lines, they're probably pointing out that inlining non-trivial functions increases register pressure which leads to larger stack frames as variables spill onto the stack. That's a bit of a strawman since calling the non-inlined function will need to build an entire new stack frame anyway. Which one is most beneficial depends on other circumstances like cache behavior and how often the code is called.
So this one is counterintuitive. You only spend stack space for local variables that don’t fit in registers. If you in-line a function with lots of local variables into a function with lots of local variables you use more stack space. And then that additional stack space can’t be reclaimed until the function it was inlined into returns. So it’s more stack space used temporarily but less used on average.
The same (and more) stack space will be used if it was called as a regular function, whether that's preserving registers across the call boundary or simply running out of registers. So this shouldn't save any space, temporarily or otherwise.
We’re not at all disagreeing. Calling vs inlining trades higher peak stack usage for (ideally) lower average stack usage.

Long running function calling short lived but high memory use function is a bad candidate for inlining because of that.